MANUAL: apt-probeset-summarize (1.17.0)

Contents

Introduction

apt-probeset-summarize is a program for doing background subtraction, normalization and summarizing probe sets from Affymetrix expression microarrays. It implements analysis algorithms such as RMA, Plier, and DABG (detected above background).

The main features of apt-probeset-summarize not common in other implementations are:

Quick Start

Most users will just want to generate summaries using RMA and/or Plier for each probeset on the microarray. We provide both 'rma' and 'rma-sketch' where 'rma-sketch' will closely approximate a full quantile normalization using a much smaller amount of memory.

On unix systems a command to do both rma-sketch and plier-sketch analysis at the same time with the default parameters looks like:

apt-probeset-summarize -a rma-sketch -a plier-mm-sketch -d chip.cdf -o output-dir *.cel

when using a CDF file or alternatively a PGF and CLF files can be specified:

apt-probeset-summarize -a rma-sketch -a plier-mm-sketch -p chip.pgf -c chip.clf -o output-dir *.cel

As the windows command prompt does not natively support wild card expansion the preferred method is to supply a text file list via the --cel-files option (see below for details of file format). A windows a command using the default parameters looks like:

apt-probeset-summarize -a rma-sketch -a plier-mm-sketch -d chip.cdf -o output-dir --cel-files cel_list.txt

Where -a specifies an analysis to do and -o specifies a directory to put the output files in. You can specify the probesets on a chip with either a CDF file via a -d or using a PGF/CLF file pair via the -p and -c flags.

If the microarray does not have mismatch probes you can specify use a surrogate mismatch based on probes with similar GC content by using the plier-gcbg analysis and specifying the background probes using the --bgp-file flag.

WARNING: apt-probeset-summarize will overwrite any existing output files it finds. If you wish to keep existing results make sure to specify a different output directory name. It is also important to note that consistent with the Bioconductor implementation all RMA output has been log2 transformed.

For details of using `apt-probeset-summarize` with GC Correction (GCCN) and Space Transformation algorithms see vignette.

Options:

apt-probeset-summarize - A program for summarizing expression probe 
data from cel files. Can use either a cdf file or pgf/clf files for defining
probesets. Use the '--explain' flag for further docuemntation on a 
particular data transformation or summary value.

usage:
   apt-probeset-summarize -a rma-sketch -a plier-mm-sketch \
        -p chip.pgf -c chip.clf -o output-dir *.cel

options:
 Common Options (not used by all programs)
   -h, --help                           Display program options and extra
                          documentation about possible analyses. See
                          -explain for information about a specific
                          operation. [default 'false'] 
   -v, --verbose How verbose to be with status messages 0 -
                          quiet, 1 - usual messages, 2 - more
                          messages. [default '1'] 
     --console-off Turn off the default messages to the 
                          console but not logging or sockets. 
                          [default 'false'] 
     --use-socket Host and port to print messages over in
                          localhost:port format [default ''] 
     --version Display version information. [default
                          'false'] 
   -f, --force Disable various checks including chip 
                          types. Consider using --chip-type option
                          rather than --force. [default 'false'] 
     --throw-exception Throw an exception rather than calling
                          exit() on error. Useful for debugging. This
                          option is intended for command line use
                          only. If you are wrapping an Engine and 
                          want exceptions thrown, then you should 
                          call Err::setThrowStatus(true) to ensure
                          that all Err::errAbort() calls result in an
                          exception. [default 'false'] 
     --analysis-files-path Search path for analysis library files. 
                          Will override AFFX_ANALYSIS_FILES_PATH
                          environment variable. [default ''] 
     --xml-file Input parameters in XML format (Will
                          override command line settings). [default
                          ''] 
     --temp-dir Directory for temporary files when working
                          off disk. Using network mounted drives is
                          not advised. When not set, the output 
                          folder will be used. The defaut is 
                          typically the output directory or the
                          current working directory. [default ''] 
   -o, --out-dir Directory for output files. Defaults to
                          current working directory. [default '.'] 
     --log-file The name of the log file. Generally 
                          defaults to the program name in the out-dir
                          folder. [default ''] 
 Engine Options (Not used on command line)
     --command-line The command line executed. [default ''] 
     --exec-guid The GUID for the process. [default ''] 
     --program-name The name of the program [default ''] 
     --program-company The company providing the program [default
                          ''] 
     --program-version The version of the program [default ''] 
     --program-cvs-id The CVS version of the program [default ''] 
     --version-to-report The version to report in the output files.
                          [default ''] 
     --free-mem-at-start How much physical memory was available when
                          the engine run started. [default '0'] 
     --meta-data-info Meta data in key=value pair that will be
                          output in headers. [default ''] 
 Input Options
     --cel-files Text file specifying cel files to process,
                          one per line with the first line being
                          'cel_files'. [default ''] 
   -d, --cdf-file File defining probe sets. Use either
                          --cdf-file, --spf-file, or --pgf-file and
                          --clf-file. Automatically sets --names.
                          [default ''] 
     --spf-file File defining probe sets in spf (simple
                          probe format) which is like a text cdf 
                          file. [default ''] 
   -p, --pgf-file File defining probe sets. [default ''] 
   -c, --clf-file File defining x,y <-> probe id conversion.
                          Required when using PGF file. [default ''] 
   -b, --bgp-file File defining probes to be used for GC
                          background. [default ''] 
   -s, --probeset-ids File specifying probe sets to summarize.
                          [default ''] 
   -m, --meta-probesets File containing meta probeset definitions.
                          File must contain a probeset_id column and 
                          a probeset_list column. [default ''] 
     --probe-class-file File containing probe_id (1-based) of 
                          probes and a 'class' designation. Used to
                          compute mean probe intensity by class for
                          report file. [default ''] 
     --qc-probesets File with probeset_id(name) and group_name
                          columns specifying subsets of probesets to
                          compute qc stats for. [default ''] 
     --chip-type Chip types to check library and CEL files
                          against. Can be specified multiple times.
                          The first one is propigated as the chip 
                          type in the output files. Warning, use of
                          this option will override the usual check
                          between chip types found in the library
                          files and cel files. You should use this
                          option instead of --force when possible.
                          [default ''] 
 Output Options
     --use-pgf-names Use the probeset_names instead of
                          probeset_id column in the PGF file for
                          output. [default 'false'] 
     --cc-chp-output Output results in directory called 'cc-chp'
                          under out-dir. This makes one AGCC
                          Expression CHP file per cel file analyzed.
                          [default 'false'] 
     --xda-chp-output Output resulting calls in directory called
                          'chp' under out-dir. This makes one GCOS 
                          XDA CHP file per cel file analyzed. 
                          [default 'false'] 
     --cc-md-chp-output Output resulting calls in directory called
                          'cc-md-chp' under out-dir. This makes one
                          AGCC Multi Data CHP file per cel file
                          analyzed. [default 'false'] 
     --cc-chp-out-dir Over-ride the default location for chp
                          output. [default ''] 
     --xda-chp-out-dir Over-ride the default location for chp
                          output. [default ''] 
     --cc-md-chp-out-dir Over-ride the default location for chp
                          output. [default ''] 
     --subsample-report Output subsamples of the data intensities,
                          summaries and residuals for error checking
                          downstream. [default 'false'] 
     --report-file Over-ride the default report file name.
                          [default ''] 
 Analysis Options
   -a, --analysis String representing analysis pathway
                          desired. For example:
                          'quant-norm,pm-gcbg,plier'. Prepackaged
                          analysis such as 'plier-gcbg-sketch',
                          'plier-gcbg', 'plier-mm-sketch', 
                          'plier-mm', 'rma-sketch', and 'rma' can be
                          specified. Multiple analysis allowed at 
                          same time. When using quantile
                          normalization, you may need to use the
                          sketch option to avoid running out of
                          memory. [default ''] 
     --summaries Output expression summaries in text table
                          format. [default 'true'] 
     --feat-effects Output feature effects when available. By
                          convention med-polish feature effects have
                          total probeset median added to them, see 
                          RMA module for details [default 'false'] 
     --use-feat-eff File defining a plier feature effect for
                          each probe. Note that precomputed effects
                          should only be used for an appropriately
                          similar analysis (i.e. feature effects for
                          pm-only may be different than for pm-mm).
                          Currently a probe is expected to be in only
                          a single probeset. This option does not 
                          work for IterPlier or SEA. [default ''] 
     --feat-details Output probe by chip specific details 
                          (often residuals) when available. [default
                          'false'] 
     --target-sketch File specifying a target distribution to 
                          use for quantile normalization. [default 
                          ''] 
     --write-sketch Write the quantile normalization
                          distribution (or sketch) to a file for 
                          reuse with target-sketch option. WARNING: 
                          If more than one -a option generates a
                          target sketch file, it is not deterministic
                          which file will be retained by the OS if 
                          the target sketch files have the same name.
                          [default 'false'] 
     --reference-profile Reference profile [default ''] 
     --write-profile write reference profile. [default 'false'] 
     --set-analysis-name Explicitly set the analysis name. This
                          affects output file names (ie prefix) and
                          various meta info. [default ''] 
   -x, --precision How many digits of precision to use after
                          decimal. [default '5'] 
 Misc Options
     --explain Explain a particular operation (i.e.
                          --explain rma-bg). [default ''] 
 Advanced Options
     --kill-list Do not use the PM probes specified in file
                          for computing results. [experimental]
                          [default ''] 
 Execution Control Options
     --use-disk Store CEL intensities to be analyzed on
                          disk. [default 'true'] 
     --disk-cache Size of intensity memory cache in millions
                          of intensities (when --use-disk=true).
                          [default '50'] 
     --store-duplicate-probes Store intensities for probes appearing in
                          multiple probesets in memory (Prevents page
                          thrashing. Is a bad idea for Axiom. Turned
                          on automatically when using meta-probesets)
                          [default 'false'] 
 A5 output options
     --a5-global-file Filename for the A5 global output file.
                          [Experimental] [default ''] 
     --a5-global-file-no-replace Append or create rather than replace.
                          [Experimental] [default 'false'] 
     --a5-group Group name where to put results in the A5
                          output files. Defaults to '/'.
                          [Experimental] [default ''] 
     --a5-summaries Output the summary values from the
                          quantifcation method for each allele in A5
                          format. [Experimental] [default 'false'] 
     --a5-summaries-use-global Use the global A5 file for summaries.
                          [Experimental] [default 'false'] 
     --a5-feature-effects Output feature effects in A5 format.
                          [Experimental] [default 'false'] 
     --a5-feature-effects-use-global Use the global A5 file for feature
                          effects.[Experimental] [default 'false'] 
     --a5-feature-details Output feature level residuals in A5 
                          format. [Experimental] [default 'false'] 
     --a5-feature-details-use-global Use the global A5 file for residuals.
                          [Experimental] [default 'false'] 
     --a5-sketch Output normalization sketch in A5 format.
                          --write-sketch option will override this
                          option. [Experimental] [default 'false'] 
     --a5-sketch-use-global Put the sketch in the global A5 output 
                          file. [Experimental] [default 'false'] 
 A5 input options
     --a5-global-input-file Filename for the group in the global input
                          file.[Experimental] [default ''] 
     --a5-input-group Group name for input. Defaults to 
                          --a5-group or if that is not set, then '/'.
                          [Experimental] [default ''] 
     --a5-sketch-input-global Read the sketch from the global A5 input
                          file. [Experimental] [default 'false'] 
     --a5-sketch-input-file Read the sketch from the an A5 input file.
                          [Experimental] [default ''] 
     --a5-sketch-input-group Group name to read the sketch from. 
                          Defaults to --a5-input-group. 
                          [Experimental] [default ''] 
     --a5-sketch-input-name The name of the data section. Defaults to
                          'target-sketch'. [Experimental] [default 
                          ''] 
     --a5-feature-effects-input-global Read the feature effects global A5 input
                          file. [Experimental] [default 'false'] 
     --a5-feature-effects-input-file Read the feature effects from the an A5
                          input file. [Experimental] [default ''] 
     --a5-feature-effects-input-group Group name to read the feature effects 
                          from. Defaults to --a5-input-group.
                          [Experimental] [default ''] 
     --a5-feature-effects-input-name The name of the data section. Defaults to
                          XXX.feature-response where XXX is the
                          analysis name and quant method. IE
                          'brlmm-p.plier'. [Experimental] [default 
                          ''] 
 Engine Options (Not used on command line)
     --cels Cel files to process. [default ''] 
     --result-files CHP file names to output. Must be paired
                          with cels. [default ''] 
     --annotation-file NetAffx Annotation database file. [default
                          ''] 
     --time-start The time the engine run was started 
                          [default ''] 
     --time-end The time the engine run ended [default ''] 
     --time-run-minutes The run time in minutes. [default ''] 
     --analysis-guid The GUID for the analysis run. [default ''] 

Standard Methods:
 'dabg'                 pm-only,dabg
 'gc-sst-rma-sketch'    gc-correction,scale-intensities,rma-bg,quant-norm.sketch=-1.usepm=true.bioc=true,pm-only,med-polish
 'plier-gcbg'           quant-norm.sketch=0.bioc=false,pm-gcbg,plier
 'plier-gcbg-sketch'    quant-norm.sketch=-1.bioc=false,pm-gcbg,plier
 'plier-mm'             quant-norm.sketch=0.bioc=false,pm-mm,plier
 'plier-mm-sketch'      quant-norm.sketch=-1.bioc=false,pm-mm,plier
 'rma'                  rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish
 'rma-gc-scale'         gc-correction,scale-intensities,rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish
 'rma-sketch'           rma-bg,quant-norm.sketch=-1.usepm=true.bioc=true,pm-only,med-polish
 'rma-sketch-gc-scale'  gc-correction,scale-intensities,rma-bg,quant-norm.sketch=-1.usepm=true.bioc=true,pm-only,med-polish

Data transformations:
   rma-bg               Performs an RMA style background adjustment 
                        as described in Irizarry et al 2003. 
   quant-norm           Class for doing quantile normalization. Can 
                        do sketch and full quantile (just set sketch
                        to chip size or zero) and supports
                        bioconductor compatibility. 
   artifact-reduction   Class for artifact reduction. 
   med-norm             Class for doing median normalization. Adjust
                        intensities such that all chips have the same
                        median (or average). 
   adapter-type-norm    Class for doing adapter type normalization.
                        Adjust intensities by adapter type. 
   gc-bg                Subtract bacground based on median intensity
                        of probes with similar GC content. 
   gc-correction        Correct feature intensity for variations in
                        gc_count. 
   scale-intensities    Scale cel intensities. 
   intensity-reporter   Class for dumping intensity values to a file. 
   no-trans             Placeholder chipstream that does no
                        transformation 

Pm Intensity Adjustments:
   pm-only   No adjustment. Just uses unmodified PM intensity values. 
   pm-mm     Use mismatch probe as adjustment for perfect match. Has
             strength of being unbiased, but often the mismatch probe
             binds the match target. 
   pm-gcbg   Do an adjustment based on the median intensity of probes
             with similar GC content. 
   pm-sum    Add itensity of PM probe for other allele to PM probes. 

Quantification Methods:
   plier        The PLIER (Probe Logarithmic Error Intensity 
                Estimate) method produces an improved signal by
                accounting for experimentally observed patterns in
                feature behavior and handling error at the
                appropriately at low and high signal values. This
                version of PLIER differs from the previous version by
                the addition of a SafteyZero, NumericalTolerance, and
                FixPrecomputed. These options are intended to improve
                the stability of PLIER results when using precomputed
                feature reponse values. To get the older PLIER
                behavior set SafetyZero to 0.0, NumericalTolerance to
                0.0, and FixPrecomputed to false. 
   sea          The SEA (Simplified Expression Analysis) method
                provides a simple signal estimate, using the
                initialization algorithm from the PLIER (Probe
                Logarithmic Error Intensity Estimate) method and
                omitting the PLIER parameter fitting. SEA is useful
                for single chip signal estimation. The version of
                PLIER used by SEA differs from the previous version 
                by the addition of a SafteyZero, NumericalTolerance,
                and FixPrecomputed. These options are intended to
                improve the stability of PLIER results when using
                precomputed feature reponse values. To get the older
                PLIER behavior set SafetyZero to 0.0,
                NumericalTolerance to 0.0, and FixPrecomputed to
                false. 
   iter-plier   Do probe set quantification estimate by iteratively
                calling PLIER with the probes that best correlate 
                with signal estimate. The version of PLIER used by
                IterPLIER differs from the previous version by the
                addition of a SafteyZero, NumericalTolerance, and
                FixPrecomputed. These options are intended to improve
                the stability of PLIER results when using precomputed
                feature reponse values. To get the older PLIER
                behavior set SafetyZero to 0.0, NumericalTolerance to
                0.0, and FixPrecomputed to false. 
   med-polish   Performs a median polish to estimate target and probe
                effects. Resulting summaries are in log2 space by
                default. Used in summary step of RMA as described in
                Irizarry et al 2003. 
   dabg         Calculates the p-value that the intensities in a
                probeset could have been observed by chance in a
                background distribution. Used as a substitute for
                standard absent/present calls when mismatch probes 
                are not available. 
   avgdiff      Calculates the average measurement for a probeset
                using the MAS 4 average difference algorithm, namely
                the average difference between the pm and mm probe
                signal. 
   median       Use the median of probes for a particular chip as the
                summary. 

Analysis Streams:
   expr           Does expression summarization on probesets. 
   pca-select     Determines PCA for probes and picks probes that are
                  near the principal component as the probes to use
                  for downstream analysis. 
   spect-select   Picks probes that are similar to each other based 
                  on spectral cluster and normalized cut. 

Example Usages:

Do an RMA and Plier using PM-MM analysis at the same time for a set of chips

apt-probeset-summarize -a rma -a plier-mm-sketch -p chip.pgf -c chip.clf -o output-dir *.cel

Do an RMA-sketch analysis and Plier using PM-MM analysis at the same time for a set of chips. Here RMA-sketch means use a subset of the chip to closely approximate the full quantile normalization while using much less memory. See Normalization for more details.

apt-probeset-summarize -a rma-sketch -a plier-mm-sketch -p chip.pgf -c chip.clf -o output-dir *.cel

Do an RMA and Plier using PM-MM analysis at the same time for a set of chips for a subset of probesets, this can significantly lower memory usage.

apt-probeset-summarize -s probesetList.txt -a rma-sketch -a plier-mm-sketch -p chip.pgf -c chip.clf -o output-dir *.cel

Do an RMA and Plier using surrogate mismatch based on GC background probes analysis at the same time for a set of chips on the meta probesets that represent the RefSeq transcripts.

apt-probeset-summarize  -a plier-gcbg-sketch -a rma-sketch -p HuEx-1_0-st-v2.pgf \
    -c HuEx-1_0-st-v2.clf -b backgroundProbes.bgp -s map.refseq.txt -o results-dir *.CEL

Replicate the results of RMA from bioconductor.

apt-probeset-summarize -a rma -d chip.cdf -o output-dir *.cel

Split up an analysis batch to run on a server. With the great diversity of unix shells and cluster job servers there are many possible ways to do split up jobs to run on a compute cluster. The example below is using the bash shell and the Portable Batch System job queuing system via the qsub command.

# Make a directory for our output files.
mkdir output
# Run on a single cel file to get full list of probesets
apt-probeset-summarize -a rma -d HG-U133_Plus_2.cdf -o names data/heart-rep1.cel
# Cut out list of probesets
cat names/rma.summary.txt | cut -f 1 | grep -E '_[sa]{1}t' > probesets.txt
# About 55,000 probesets do in 10 jobs with 5,500 probesets each
wc probesets.txt
mkdir lists
split -d -l 5500 probesets.txt lists/probes_
# Prepend the required header 'probeset_id' to each file.
for list in `find lists -name "probes_*"`; do
   echo "probeset_id" >> $list.sub;
   cat $list >> $list.sub;
done;

# Submit a job for each file.
rm jobsToRun.txt
for list in `find lists -name "probes_*.sub"`; do
   cwd=`pwd`
   base=`basename $list`
   echo "Doing $base"
   echo "apt-probeset-summarize -s $cwd/$list -a rma-sketch -d $cwd/HG-U133_Plus_2.cdf  -o $cwd/output/$base-dir $cwd/data/ *.cel" | qsub
# Then grab everything but the header info from all the files.
cat `find ./output -name "rma-sketch.summary.txt" | sort` | grep -v -E '#|probeset_id'  >> allResults.txt

Concept Manual:

A Word about Program Options vs Analysis Parameters:

The results generated by apt-probeset-genotype can be affected by both the program level options (like --target-sketch, --meta-probesets, --bgp-file) and the analysis parameters used to create an analysis specification as supplied via the --analysis option (like quant-norm.usepm=true.bioc=true,pm-only,med-polish). The analysis level parameters are aimed at advanced users and others are encouraged to stick with the "Standard Methods" described above in the Options section.

The ability to specify parameters to an analysis via the analysis specification provides a powerful mechanism to construct custom analysis, but can be confusing at first as the output from most programs is controlled only by the program level options. In spirit the ability to specify analysis level parameters is analogous to providing a custom regular expression to the unix program 'grep'. The available parameters for a particular step of an analysis can be seen using the --explain option (i.e. apt-probeset-summarize --explain quant-norm). See below for details on the comma and period separated format.

Some Concepts:

An analysis of Affymetrix microarrays usually starts with some combination of background subtraction, normalization, and summarization of all the probes from a particular probeset. There are a variety of different methods to perform each one of these steps. apt-probeset-summarize aims to provide a flexible way to specify an analysis and compute them efficiently. Key concepts for understanding what apt-probeset-summarize does and how it does it are:

Custom Analysis Specification:

As discussed above an analysis specification via a comma separated list of different analysis modules making up the analysis. Every analysis path must terminate with a PM modification module (i.e. pm-only or pm-mm) and a probeset summarization step (i.e. median polish for RMA the analysis specification 'quant-norm.sketch=0.usepm=true.bioc=true' means the 'sketch=0' parameter indicate sketch size should be all the data, 'usepm=true' indicates that only the perfect match probes should be used for normalization, and 'bioc=true' means that the Bioconductor method of resolving ties in the quantile normalization should be used.

As an example, RMA style analysis, but with a linear normalization instead of the more aggressive quantile the analysis specification would look like: 'rma-bg,med-norm,pm-only,med-polish'

They types of data transformations, PM intensity adjustments, and quantification methods supported by apt-probeset-summarize can be seen by calling apt-probeset-summarize with the --help flag. The parameters available for a particular method can be seen by calling apt-probeset-summarize with the --explain flag (i.e. --explain quant-norm)

Normalization:

Normalization is one of those topics where one can get 10 different opinions from 10 different people. The focus of this section is not to tell you which normalization method to use, but rather to alert you to some of the potential pitfalls in implementation.

Quantile normalization makes the entire distribution of data from different chips the same. One of the first steps of doing a quantile normalization is to sort the data from the chip. This leads to an implementation issue of what to do when ties occur (which often happens). The Bioconductor normalize.quantile() function from the affy package resolves ties by setting all the ties to have the value of the data point at the middle of the run of ties. The limma package has a function called normalizeQuantiles() that by default arbitrarily breaks ties by ignoring the fact that they are occurring. If the ties=TRUE flag is set normalizeQuantiles() will do the mathematically "correct" thing and use the average value of all of the ties as the final value to use, but incurs a significant slowdown in run time. By default the apt-probeset-summarize implementation does the mathematically correct method of taking the average of all the tied values, but has been optimized to not incur a significant performance cost. For compatibility with the RMA method in Bioconductor the quant-norm module in apt-probeset-summarize has an optional parameter 'bioc' which can be set to true to get the normalize.quantiles() version of quantile normalization.

A full quantile normalization is generally memory intensive as the typical implementation loads the all of the data for all of the chips into memory. It is possible to use a subsample, or sketch, of the data to approximate the full quantile normalization. As we expect the data to be continuous we use linear interpolation when a data point falls in between the samples in the sketch. In practice, the sketch approximation is very close to the full quantile as long as sketch is reasonably dense (default to 1% of probes or 50,000 data points, whichever is larger). Thus, using a sketch normalization means that you are no longer limited by the number of chips that you wish to analyze, but rather by the number of probesets that you wish to analyze at once.

Linear normalization doesn't have as many implementation pitfalls as quantile normalization, but the determination of the target value to adjust the arrays to and the summary metric for an array can vary from method to method. Sometimes the summary metric used is the average and other times the median is used. The target value to adjust the chips to is sometimes generated by designating one chip as the 'reference chip' and adjusting all the other chips linearly to have the same mean (or median) as the reference chip. Another approach is to calculate a summary value from all of the chips (i.e. a virtual average chip). By default apt-probeset-summarize will normalize all the chips to the median value of the medians from each chip. One gotcha with any linear normalization is to make sure that you are normalizing to the signal of the array and not the background. This is especially true for arrays with a large number of speculative content such as tiling arrays and the exon array.

When Problems Occur and Bugs Arise.

Should a problem arise that isn't addressed by the FAQ or this documentation please post a question at the APT devnet forum. Note that you'll need to register (at no cost) to use the Affymetrix devnet forums.

Please be as descriptive as possible of your issue. It is also very helpful to post the apt-probeset-summarize.log file (found in the output directory) and the command line your using.

Frequently Asked Questions

Q. What do I do when I don't have enough memory to process all the data?

A. You can use a sketch quantile normalization rather than full quantile normalization. For example, using 'rma-sketch' (via --analysis option) instead of 'rma' will significantly reduce memory footprint while having a minimal impact on the results.

Q. I get slightly different value than RMA using the PGF file rather than the CDF file, why is that?

A. The CDF file excludes certain probesets that are contained in the PGF file. The inclusion of these probes changes slightly the RMA background subtraction parameters and the normalization results to produce slightly different results, which should not be significant.

Q. I get slightly different values with Plier when I use the precomputed feature effects than when I learn them on the same data set, Why is that?

A. Plier is an iterative algorithm that stops searching for a better solution once a "good" fit has been found. When the feature effects are specified rather than learned the fit may be slightly different, but in the view of plier it is just as "good" as the original fit.

Q. How do I get probe level (rather than probeset level) p-values from dabg?

A. Use the --feature-details option. For most methods --feature-details outputs the residuals, but for dabg it will output the probe level p-values.

Q. How do I get more precision in the output than the 3 decimal places provided?

A. Use the --precision option to get more precision to the program. this will result in larger text files.

Q. Can I get more precision especially for dabg as I need to set a low threshold to address multiple testing concerns?

A. Try using the 'neglog10' parameter with dabg which will report -1 * log_10(p-value). See 'apt-probeset-summarize --explain dabg' for details, but an example analysis specification would be: '--analysis pm-only,dabg.neglog10=true'

Q. The output using RMA looks odd, Why are the values so small (in 0-16 range)? Plier and Mas5 give much larger values.

A. As with the original RMA analysis specification of '--analysis rma' which is a alias for '--analysis rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish' to '--analysis rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish.expon=true' (note the small expon=true at end of specification).

Q. I get the an error, "Error: Didn't find required column: 'cel_files' in file: 'cel_file_groups.tsv'", when using apt-probeset-summarize with data previous analyzed using ExACT.

A. apt-probeset-summarize expects the list of cel file names (provided with the --cel-files option) to have a header, "cel_files". ExACT expected/generated a header value of "cel_file". To use your ExACT cel file listing files with APT you will need to remove the "s" from "cel_files" in the header.

Q. I get slightly different answers when I normalize with exact-normalize.pl and then generate summary estimates vs. normalizing directly with apt-probeset-summarize - what is going on?

A. When exact-normalize.pl writes the normalized values to cel files the floating point values are truncated for storage reasons. If you use XDA as the output format and specify 'lowprecision=true' to the apt-probeset-summarize normalization modules (i.e. rma-bg,quant-norm.lowprecision=true,pm-gcbg,plier') you should get the same values.

Q. Where can I get more information on the pca-select feature?

A. Currently there is a talk from at the Affymetrix Low Level Dynamically Selecting Probes for Gene-Level Expression Estimates Using a Principal Components Based Approach

Q. How is the mask section in the CEL file used?

A. It is not. The contents of this section of the CEL file are ignored.

Q. How can I find out more information about the analysis string:

    rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish

A. Start with the included manual for apt-probeset-summarize. Usage information is also provided if you run apt-probeset-summarize without any arguments. Lastly, use the "--explain" option for more information about particular analysis methods. IE:

    apt-probeset-summarize --explain quant-norm.

Q. Wild cards do not work on windows. For example:

    apt-probeset-summarize ... *.CEL

A. APT relies on the command shell to do the wild card expansion (ie bash shell on NIX systems). The windows shell does not do wild card expansion, so there is no wild card expansion for APT when run from the windows shell. You may want to try a different windows shell or perhaps bash via cygwin. See the --cel-files option as an alternative to specify CEL files for analysis.

Q. In expression console I can select the option "Scale to all probe sets" and specify a target (e.g. 200). How can I specify this in apt-probeset-summarize ?

Q. Are APT CHP files compatible with EC? Are they identical to what EC generates?

A. See the CHP File Differences between EC, GC, and APT vignette for more information.

Q. Does apt-probeset-summarize generate a quality assessment report like EC?

A. Yes. There will be a report.txt output file (ie rma-sketch.report.txt) which includes many of the same quality assessment metrics.

Q. What happens to non-pm probes when pulled through the rma-bg chipstream?

A. The background parameterization is based only on PM probes. If methods request an intensity value for a non-PM probe, then the probe itensity of that non-PM probe will be background adjusted just as if it were a PM probe. The following is an example of an analysis stream where this would occur:   

        rma-bg,quant-norm,pm-mm,plier
   

Of course combining the global background correction from RMA with PLIER and a probe specific correction using the MM may result in poor results.

Q. What is a probe_id?

A. See the FAQ item on probe IDs for more info.