MANUAL: apt-probeset-summarize (apt-1.10.0)

Contents

Introduction

apt-probeset-summarize is a program for doing background subtraction, normalization and summarizing probe sets from Affymetrix expression microarrays. It implements analysis algorithms such as RMA, Plier, MAS5 detection (no signal level normalization; note MAS5 FAQ items), MAS5 signal, and DABG (detected above background).

The main features of apt-probeset-summarize not common in other implementations are:

Quick Start

Most users will just want to generate summaries using RMA and/or Plier for each probeset on the microarray. We provide both 'rma' and 'rma-sketch' where 'rma-sketch' will closely approximate a full quantile normalization using a much smaller amount of memory.

On unix systems a command to do both rma-sketch and plier-sketch analysis at the same time with the default parameters looks like:

apt-probeset-summarize -a rma-sketch -a plier-mm-sketch -d chip.cdf -o output-dir *.cel

when using a CDF file or alternatively a PGF and CLF files can be specified:

apt-probeset-summarize -a rma-sketch -a plier-mm-sketch -p chip.pgf -c chip.clf -o output-dir *.cel

As the windows command prompt does not natively support wild card expansion the preferred method is to supply a text file list via the --cel-files option (see below for details of file format). A windows a command using the default parameters looks like:

apt-probeset-summarize -a rma-sketch -a plier-mm-sketch -d chip.cdf -o output-dir --cel-files cel_list.txt

Where -a specifies an analysis to do and -o specifies a directory to put the output files in. You can specify the probesets on a chip with either a CDF file via a -d or using a PGF/CLF file pair via the -p and -c flags.

If the microarray does not have mismatch probes you can specify use a surrogate mismatch based on probes with similar GC content by using the plier-gcbg analysis and specifying the background probes using the --bgp-file flag.

WARNING: apt-probeset-summarize will overwrite any existing output files it finds. If you wish to keep existing results make sure to specify a different output directory name. It is also important to note that consistent with the Bioconductor implementation all RMA output has been log2 transformed.

Options:

apt-probeset-summarize - A program for summarizing expression probe 
data from cel files. Can use either a cdf file or pgf/clf files for defining
probesets. Use the '--explain' flag for further docuemntation on a 
particular data transformation or summary value.

usage:
   apt-probeset-summarize -a rma-sketch -a plier-mm-sketch \
        -p chip.pgf -c chip.clf -o output-dir *.cel

options:
 Basic Info and Control Options
   -h, --help                           Display program options and extra
                          documentation about possible analyses. See
                          -explain for information about a specific
                          operation. 
     --explain Explain a particular operation (i.e.
                          --explain rma-bg). 
   -v, --verbose How verbose to be with status messages 0 -
                          quiet, 1 - usual messages, 2 - more
                          messages. 
     --version Display version information. 
   -f, --force Disable various checks including chip 
                          types. Consider using --chip-type option
                          rather than --force. 
 Input Options
     --cel-files Text file specifying cel files to process,
                          one per line with the first line being
                          'cel_files'. 
   -d, --cdf-file File defining probe sets. Use either
                          --cdf-file, --spf-file, or --pgf-file and
                          --clf-file. Automatically sets --names. 
     --spf-file File defining probe sets in spf (simple
                          probe format) which is like a text cdf 
                          file. 
   -p, --pgf-file File defining probe sets. 
   -c, --clf-file File defining x,y <-> probe id conversion.
                          Required when using PGF file. 
   -b, --bgp-file File defining probes to be used for GC
                          background. 
   -s, --probeset-ids File specifying probe sets to summarize. 
   -m, --meta-probesets File containing meta probeset definitions.
                          File must contain a probeset_id column and 
                          a probeset_list column. 
     --qc-probesets File with probeset_id(name) and group_name
                          columns specifying subsets of probesets to
                          compute qc stats for. 
     --chip-type Chip types to check library and CEL files
                          against. Can be specified multiple times.
                          The first one is propigated as the chip 
                          type in the output files. Warning, use of
                          this option will override the usual check
                          between chip types found in the library
                          files and cel files. You should use this
                          option instead of --force when possible. 
 Output Options
   -o, --out-dir Directory to write result files into. 
     --cc-chp-output Output results in directory called 'cc-chp'
                          under out-dir. This makes one AGCC
                          Expression CHP file per cel file analyzed.
                          [default: false] 
     --xda-chp-output Output resulting calls in directory called
                          'chp' under out-dir. This makes one GCOS 
                          XDA CHP file per cel file analyzed.
                          [default: false] 
     --cc-md-chp-output Output resulting calls in directory called
                          'cc-md-chp' under out-dir. This makes one
                          AGCC Multi Data CHP file per cel file
                          analyzed. [default: false] 
     --subsample-report Output subsamples of the data intensities,
                          summaries and residuals for error checking
                          downstream. 
   -n, --names Output names rather than ids. 
 Analysis Options
   -a, --analysis String representing analysis pathway
                          desired. For example:
                          'quant-norm,pm-gcbg,plier'. Prepackaged
                          analysis such as 'plier-gcbg-sketch',
                          'plier-gcbg', 'plier-mm-sketch', 
                          'plier-mm', 'rma-sketch', and 'rma' can be
                          specified. Multiple analysis allowed at 
                          same time. When using quantile
                          normalization, you may need to use the
                          sketch option to avoid running out of
                          memory. 
     --feature-effects Output feature effects when available. 
     --use-feat-effect File defining a plier feature effect for
                          each probe. Note that precomputed effects
                          should only be used for an appropriately
                          similar analysis (i.e. feature effects for
                          pm-only may be different than for pm-mm).
                          Currently a probe is expected to be in only
                          a single probeset. This option does not 
                          work for IterPlier or SEA. 
     --feature-details Output probe by chip specific details 
                          (often residuals) when available. 
     --target-sketch File specifying a target distribution to 
                          use for quantile normalization. 
     --write-sketch Write the quantile normalization
                          distribution (or sketch) to a file for 
                          reuse with target-sketch option. WARNING: 
                          If more than one -a option generates a
                          target sketch file, it is not deterministic
                          which file will be retained by the OS if 
                          the target sketch files have the same name. 
     --set-analysis-name Explicitly set the analysis name. This
                          affects output file names (ie prefix) and
                          various meta info. 
   -x, --precision How many digits of precision to use after
                          decimal. 
 Advanced Options
     --kill-list Do not use the PM probes specified in file
                          for computing results. [experimental] 
 Execution Control Options
     --mem-usage How much memory (RAM) to use for this job 
                          in megabytes. Only relevant when
                          --use-disk=false. [default: 0] 
     --block-size How many probesets to process at once,
                          useful when memory is limited (0 for all).
                          Only relevant when --use-disk=false.
                          [default: 0] 
     --use-disk Use disk based representation to avoid
                          excessive RAM use. [default: true] 
     --disk-dir Directory for temporary files when working
                          off disk. Using network mounted drives is
                          not advised. [default: out-dir folder] 
     --disk-cache Size of memory cache when working off disk
                          in megabytes. [default: 50] 

Standard Methods:
 'dabg'               pm-only,dabg
 'plier-gcbg'         quant-norm.sketch=0.bioc=false,pm-gcbg,plier
 'plier-gcbg-sketch'  quant-norm.sketch=-1.bioc=false,pm-gcbg,plier
 'plier-mm'           quant-norm.sketch=0.bioc=false,pm-mm,plier
 'plier-mm-sketch'    quant-norm.sketch=-1.bioc=false,pm-mm,plier
 'rma'                rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish
 'rma-sketch'         rma-bg,quant-norm.sketch=-1.usepm=true.bioc=true,pm-only,med-polish

Data transformations:
   rma-bg              Performs an RMA style background adjustment as
                       described in Irizarry et al 2003. 
   quant-norm          Class for doing quantile normalization. Can do
                       sketch and full quantile (just set sketch to
                       chip size or zero) and supports bioconductor
                       compatibility. 
   med-norm            Class for doing median normalization. Adjust
                       intensities such that all chips have the same
                       median (or average). 
   adapter-type-norm   Class for doing adapter type normalization.
                       Adjust intensities by adapter type. 
   mas5-bg             Performs a MAS 5 background adjustment as
                       described in Liu et al, Bioinformatics (2002). 
   gc-bg               Subtract bacground based on median intensity 
                       of probes with similar GC content. 

Pm Intensity Adjustments:
   pm-only   No adjustment. Just uses unmodified PM intensity values. 
   pm-mm     Use mismatch probe as adjustment for perfect match. Has
             strength of being unbiased, but often the mismatch probe
             binds the match target. 
   pm-gcbg   Do an adjustment based on the median intensity of probes
             with similar GC content. 
   pm-sum    Add itensity of PM probe for other allele to PM probes. 

Quantification Methods:
   plier         The PLIER (Probe Logarithmic Error Intensity
                 Estimate) method produces an improved signal by
                 accounting for experimentally observed patterns in
                 feature behavior and handling error at the
                 appropriately at low and high signal values. This
                 version of PLIER differs from the previous version 
                 by the addition of a SafteyZero, NumericalTolerance,
                 and FixPrecomputed. These options are intended to
                 improve the stability of PLIER results when using
                 precomputed feature reponse values. To get the older
                 PLIER behavior set SafetyZero to 0.0,
                 NumericalTolerance to 0.0, and FixPrecomputed to
                 false. 
   sea           The SEA (Simplified Expression Analysis) method
                 provides a simple signal estimate, using the
                 initialization algorithm from the PLIER (Probe
                 Logarithmic Error Intensity Estimate) method and
                 omitting the PLIER parameter fitting. SEA is useful
                 for single chip signal estimation. The version of
                 PLIER used by SEA differs from the previous version
                 by the addition of a SafteyZero, NumericalTolerance,
                 and FixPrecomputed. These options are intended to
                 improve the stability of PLIER results when using
                 precomputed feature reponse values. To get the older
                 PLIER behavior set SafetyZero to 0.0,
                 NumericalTolerance to 0.0, and FixPrecomputed to
                 false. 
   iter-plier    Do probe set quantification estimate by iteratively
                 calling PLIER with the probes that best correlate
                 with signal estimate. The version of PLIER used by
                 IterPLIER differs from the previous version by the
                 addition of a SafteyZero, NumericalTolerance, and
                 FixPrecomputed. These options are intended to 
                 improve the stability of PLIER results when using
                 precomputed feature reponse values. To get the older
                 PLIER behavior set SafetyZero to 0.0,
                 NumericalTolerance to 0.0, and FixPrecomputed to
                 false. 
   med-polish    Performs a median polish to estimate target and 
                 probe effects. Resulting summaries are in log2 space
                 by default. Used in summary step of RMA as described
                 in Irizarry et al 2003. 
   dabg          Calculates the p-value that the intensities in a
                 probeset could have been observed by chance in a
                 background distribution. Used as a substitute for
                 standard absent/present calls when mismatch probes
                 are not available. 
   mas5-detect   Calculates the p-value for detection of an expressed
                 gene using the MAS 5.0 algorithm. This is a
                 rank-based algorithm, using discrimination scores,
                 described in Liu et al., Bioinformatics (2002)
                 18:1593 and the Statistical Algorithms Reference
                 Guide. 
   mas5-signal   Calculates the average measurement for a probeset
                 using the MAS 5.0 algorithm. This is based on a
                 robust estimator, Tukey's biweight, described in
                 Hubbell et al., Bioinformatics (2002) 18:1585 and 
                 the Statistical Algorithms Reference Guide. WARNING:
                 The implementation in APT does not allow for signal
                 level normalization across the chip. See the FAQ 
                 item in the manual. 
   avgdiff       Calculates the average measurement for a probeset
                 using the MAS 4 average difference algorithm, namely
                 the average difference between the pm and mm probe
                 signal. 
   median        Use the median of probes for a particular chip as 
                 the summary. 

Analysis Streams:
   expr           Does expression summarization on probesets. 
   pca-select     Determines PCA for probes and picks probes that are
                  near the principal component as the probes to use
                  for downstream analysis. 
   spect-select   Picks probes that are similar to each other based 
                  on spectral cluster and normalized cut. 

Example Usages:

Do an RMA and Plier using PM-MM analysis at the same time for a set of chips
apt-probeset-summarize -a rma -a plier-mm-sketch -p chip.pgf -c chip.clf -o output-dir *.cel

Do an RMA-sketch analysis and Plier using PM-MM analysis at the same time for a set of chips. Here RMA-sketch means use a subset of the chip to closely approximate the full quantile normalization while using much less memory. See Normalization and Memory below for more details.

apt-probeset-summarize -a rma-sketch -a plier-mm-sketch -p chip.pgf -c chip.clf -o output-dir *.cel

Do an RMA and Plier using PM-MM analysis at the same time for a set of chips for a subset of probesets, this can significantly lower memory usage.

apt-probeset-summarize -s probesetList.txt -a rma-sketch -a plier-mm-sketch -p chip.pgf -c chip.clf -o output-dir *.cel

Do an RMA and Plier using surrogate mismatch based on GC background probes analysis at the same time for a set of chips on the meta probesets that represent the RefSeq transcripts.

apt-probeset-summarize  -a plier-gcbg-sketch -a rma-sketch -p HuEx-1_0-st-v2.pgf \
    -c HuEx-1_0-st-v2.clf -b backgroundProbes.bgp -s map.refseq.txt -o results-dir *.CEL

Replicate the results of RMA from bioconductor.

apt-probeset-summarize -a rma -d chip.cdf -o output-dir *.cel

Split up an analysis batch to run on a server. With the great diversity of unix shells and cluster job servers there are many possible ways to do split up jobs to run on a compute cluster. The example below is using the bash shell and the Portable Batch System job queuing system via the qsub command.

# Make a directory for our output files.
mkdir output
# Run on a single cel file to get full list of probesets
apt-probeset-summarize -a rma -d HG-U133_Plus_2.cdf -o names data/heart-rep1.cel
# Cut out list of probesets
cat names/rma.summary.txt | cut -f 1 | grep -E '_[sa]{1}t' > probesets.txt
# About 55,000 probesets do in 10 jobs with 5,500 probesets each
wc probesets.txt
mkdir lists
split -d -l 5500 probesets.txt lists/probes_
# Prepend the required header 'probeset_id' to each file.
for list in `find lists -name "probes_*"`; do
   echo "probeset_id" >> $list.sub;
   cat $list >> $list.sub;
done;

# Submit a job for each file.
rm jobsToRun.txt
for list in `find lists -name "probes_*.sub"`; do
   cwd=`pwd`
   base=`basename $list`
   echo "Doing $base"
   echo "apt-probeset-summarize -s $cwd/$list -a rma-sketch -d $cwd/HG-U133_Plus_2.cdf  -o $cwd/output/$base-dir $cwd/data/ *.cel" | qsub
done

# Put all the results in one file.
# First grab the header information from one file
grep -E '#|probeset_id' ./output/probes_00.sub-dir/rma-sketch.summary.txt > allResults.txt
# Then grab everything but the header info from all the files.
cat `find ./output -name "rma-sketch.summary.txt" | sort` | grep -v -E '#|probeset_id'  >> allResults.txt

Concept Manual:

A Word about Program Options vs Analysis Parameters:

The results generated by apt-probeset-genotype can be affected by both the program level options (like --target-sketch, --meta-probesets, --bgp-file) and the analysis parameters used to create an analysis specification as supplied via the --analysis option (like quant-norm.usepm=true.bioc=true,pm-only,med-polish). The analysis level parameters are aimed at advanced users and others are encouraged to stick with the "Standard Methods" described above in the Options section.

The ability to specify parameters to an analysis via the analysis specification provides a powerful mechanism to construct custom analysis, but can be confusing at first as the output from most programs is controlled only by the program level options. In spirit the ability to specify analysis level parameters is analogous to providing a custom regular expression to the unix program 'grep'. The available parameters for a particular step of an analysis can be seen using the --explain option (i.e. apt-probeset-summarize --explain quant-norm). See below for details on the comma and period separated format.

Some Concepts:

An analysis of Affymetrix microarrays usually starts with some combination of background subtraction, normalization, and summarization of all the probes from a particular probeset. There are a variety of different methods to perform each one of these steps. apt-probeset-summarize aims to provide a flexible way to specify an analysis and compute them efficiently. Key concepts for understanding what apt-probeset-summarize does and how it does it are:

Custom Analysis Specification:

As discussed above an analysis specification via a comma separated list of different analysis modules making up the analysis. Every analysis path must terminate with a PM modification module (i.e. pm-only or pm-mm) and a probeset summarization step (i.e. median polish for RMA or Plier). Different parameters for a particular module can be specified using a 'key=value' syntax. In the case of RMA the analysis specification 'quant-norm.sketch=0.usepm=true.bioc=true' means the 'sketch=0' parameter indicate sketch size should be all the data, 'usepm=true' indicates that only the perfect match probes should be used for normalization, and 'bioc=true' means that the Bioconductor method of resolving ties in the quantile normalization should be used.

As an example, RMA usually uses a quantile normalization and the RMA alias analysis is: rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish If it was desired to do an RMA style analysis, but with a linear normalization instead of the more aggressive quantile the analysis specification would look like: 'rma-bg,med-norm,pm-only,med-polish'

They types of data transformations, PM intensity adjustments, and quantification methods supported by apt-probeset-summarize can be seen by calling apt-probeset-summarize with the --help flag. The parameters available for a particular method can be seen by calling apt-probeset-summarize with the --explain flag (i.e. --explain quant-norm)

Normalization:

Normalization is one of those topics where one can get 10 different opinions from 10 different people. The focus of this section is not to tell you which normalization method to use, but rather to alert you to some of the potential pitfalls in implementation.

Quantile normalization makes the entire distribution of data from different chips the same. One of the first steps of doing a quantile normalization is to sort the data from the chip. This leads to an implementation issue of what to do when ties occur (which often happens). The Bioconductor normalize.quantile() function from the affy package resolves ties by setting all the ties to have the value of the data point at the middle of the run of ties. The limma package has a function called normalizeQuantiles() that by default arbitrarily breaks ties by ignoring the fact that they are occurring. If the ties=TRUE flag is set normalizeQuantiles() will do the mathematically "correct" thing and use the average value of all of the ties as the final value to use, but incurs a significant slowdown in run time. By default the apt-probeset-summarize implementation does the mathematically correct method of taking the average of all the tied values, but has been optimized to not incur a significant performance cost. For compatibility with the RMA method in Bioconductor the quant-norm module in apt-probeset-summarize has an optional parameter 'bioc' which can be set to true to get the normalize.quantiles() version of quantile normalization.

A full quantile normalization is generally memory intensive as the typical implementation loads the all of the data for all of the chips into memory. It is possible to use a subsample, or sketch, of the data to approximate the full quantile normalization. As we expect the data to be continuous we use linear interpolation when a data point falls in between the samples in the sketch. In practice, the sketch approximation is very close to the full quantile as long as sketch is reasonably dense (default to 1% of probes or 50,000 data points, whichever is larger). Thus, using a sketch normalization means that you are no longer limited by the number of chips that you wish to analyze, but rather by the number of probesets that you wish to analyze at once.

Linear normalization doesn't have as many implementation pitfalls as quantile normalization, but the determination of the target value to adjust the arrays to and the summary metric for an array can vary from method to method. Sometimes the summary metric used is the average and other times the median is used. The target value to adjust the chips to is sometimes generated by designating one chip as the 'reference chip' and adjusting all the other chips linearly to have the same mean (or median) as the reference chip. Another approach is to calculate a summary value from all of the chips (i.e. a virtual average chip). By default apt-probeset-summarize will normalize all the chips to the median value of the medians from each chip. One gotcha with any linear normalization is to make sure that you are normalizing to the signal of the array and not the background. This is especially true for arrays with a large number of speculative content such as tiling arrays and the exon array.

Memory (RAM) Issues:

Note that some analysis use more memory than others. In particular any full quantile normalization (i.e. RMA) can demand large amounts of RAM (specifically 4*numProbes*numChips bytes). By modifying the analysis it is often possible to use much less memory, for example using a sketch quantile normalization can greatly reduce the amount of memory necessary.

The rest of this section is only relevant if you specify --use-disk=false. Starting with release 1.10.0, apt-probeset-summarize default behavior is to use temporary files and a single iteration. As such, you should not see memory issues when using the default settings. If you decide to force the in-memory mode (--use-disk=false) then read on...

By default (when in-memory mode is forced using --use-disk=false) apt-probeset-summarize will attempt to guess the amount of available memory on the system and divide the job into small enough chunks (blocks) of probesets to fit in that amount of memory. This guess is based on available resources at the beginning of the program and is easily fooled by either running multiple versions of apt-probeset-summarize or running other applications on the computer at the same time. Advanced users are encouraged to use the --block-size option to fine tune the amount of memory used by apt-probeset-summarize.

When Problems Occur and Bugs Arise.

Should a problem arise that isn't addressed by the FAQ or this documentation please post a question at the APT devnet forum. Note that you'll need to register (at no cost) to use the Affymetrix devnet forums.

Please be as descriptive as possible of your issue. It is also very helpful to post the apt-probeset-summarize.log file (found in the output directory) and the command line your using.

Frequently Asked Questions

Q. What do I do when I don't have enough memory to process all the data?

A. You can use a sketch quantile normalization rather than full quantile normalization. For example, using 'rma-sketch' (via --analysis option) instead of 'rma' will significantly reduce memory footprint while having a minimal impact on the results.

Q. I get slightly different value than RMA using the PGF file rather than the CDF file, why is that?

A. The CDF file excludes certain probesets that are contained in the PGF file. The inclusion of these probes changes slightly the RMA background subtraction parameters and the normalization results to produce slightly different results, which should not be significant.

Q. I get slightly different values with Plier when I use the precomputed feature effects than when I learn them on the same data set, Why is that?

A. Plier is an iterative algorithm that stops searching for a better solution once a "good" fit has been found. When the feature effects are specified rather than learned the fit may be slightly different, but in the view of plier it is just as "good" as the original fit.

Q. How do I get probe level (rather than probeset level) p-values from dabg?

A. Use the --feature-details option. For most methods --feature-details outputs the residuals, but for dabg it will output the probe level p-values.

Q. How do I get more precision in the output than the 3 decimal places provided?

A. Use the --precision option to get more precision to the program. this will result in larger text files.

Q. Can I get more precision especially for dabg as I need to set a low threshold to address multiple testing concerns?

A. Try using the 'neglog10' parameter with dabg which will report -1 * log_10(p-value). See 'apt-probeset-summarize --explain dabg' for details, but an example analysis specification would be: '--analysis pm-only,dabg.neglog10=true'

Q. The output using RMA looks odd, Why are the values so small (in 0-16 range)? Plier and Mas5 give much larger values.

A. As with the original RMA implementation the values are on the log2 scale. You can use the 'expon' parameter to med-polish to exponentiate them back to the scale used by Plier and Mas5. See 'apt-probeset-summarize --explain med-polish' for details. An example usage would be to change from the default RMA analysis specification of '--analysis rma' which is a alias for '--analysis rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish' to '--analysis rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish.expon=true' (note the small expon=true at end of specification).

Q. I get the an error, "Error: Didn't find required column: 'cel_files' in file: 'cel_file_groups.tsv'", when using apt-probeset-summarize with data previous analyzed using ExACT.

A. apt-probeset-summarize expects the list of cel file names (provided with the --cel-files option) to have a header, "cel_files". ExACT expected/generated a header value of "cel_file". To use your ExACT cel file listing files with APT you will need to remove the "s" from "cel_files" in the header.

Q. I get slightly different answers when I normalize with exact-normalize.pl and then generate summary estimates vs. normalizing directly with apt-probeset-summarize - what is going on?

A. When exact-normalize.pl writes the normalized values to cel files the floating point values are truncated for storage reasons. If you use XDA as the output format and specify 'lowprecision=true' to the apt-probeset-summarize normalization modules (i.e. rma-bg,quant-norm.lowprecision=true,pm-gcbg,plier') you should get the same values.

Q. Where can I get more information on the pca-select feature?

A. Currently there is a talk from at the Affymetrix Low Level Analysis Workshop by Chuck Sugnet on the method. See Dynamically Selecting Probes for Gene-Level Expression Estimates Using a Principal Components Based Approach

Q. How is the mask section in the CEL file used?

A. It is not. The contents of this section of the CEL file are ignored.

Q. How can I find out more information about the analysis string:

    rma-bg,quant-norm.sketch=0.usepm=true.bioc=true,pm-only,med-polish

A. Start with the included manual for apt-probeset-summarize. Usage information is also provided if you run apt-probeset-summarize without any arguments. Lastly, use the "--explain" option for more information about particular analysis methods. IE:

    apt-probeset-summarize --explain quant-norm.

Q. Wild cards do not work on windows. For example:

    apt-probeset-summarize ... *.CEL

A. APT relies on the command shell to do the wild card expansion (ie bash shell on NIX systems). The windows shell does not do wild card expansion, so there is no wild card expansion for APT when run from the windows shell. You may want to try a different windows shell or perhaps bash via cygwin. See the --cel-files option as an alternative to specify CEL files for analysis.

Q. In expression console I can select the option "Scale to all probe sets" and specify a target (e.g. 200). How can I specify this in apt-probeset-summarize ?

A. The MAS5 implementation in APT does not support these options. Specifically, the MAS5 algorithm specifies a post-summarization normalization over all or a selected set of probeset signal estimates. This requires the software to have signal values for all probesets on the chip to compute the normalization factor. apt-probeset-summarize has been optimized for model based methods (ie RMA and PLIER) and as such is setup to process probesets in chunks over lots of chips at a single time. Thus the information to compute the scaling factor is not available at the time the signal estimates are computed.

So there is a significant implication here. The mas5-signal values reported are not scaled and thus not normalized across chips. (That is unless you are mixing in mas5-signal and mas5-bg methods with the median or quantile probe level normalization -- in which case the results would no longer be MAS5. The analysis string for mas5-signal values is "mas5-bg,pm-mm,mas5-signal".)

So probeset level signal normalization (in a downstream analysis package) is necessary.

Q. What is the analysis string for MAS5 detection and MAS5 signal?

A. Note the question above regarding MAS5 and EC vs APT implementations. With that in mind, the analysis strings are:

Q. I do not get the same MAS5 results in APT that I get out of the MAS5 software. Why?

A. More-or-less, a new implementation of the MAS5 methods was necessary to get this method to work in the APT architecture. In doing so we noticed that there are some numerical inconsistencies between platforms and between the old and new implementation. In general these differences are small and insignificant. In some cases, however the differences can be larger due to probes for which an imputed mismatch is used in one case and the real mismatch in another case due to slight numerical differences at the probe level. So you will not get strictly the same results as what the MAS5 software returns. Another note is that we have not done nearly as much testing of the MAS5 stuff in APT. Lastly, it should be noted that Expression Console (EC) uses the older MAS5 implementation and not the APT MAS5 implementation.

Q. Are APT CHP files compatible with EC? Are they identical to what EC generates?

A. See the CHP File Differences between EC, GC, and APT vignette for more information.

Q. Does apt-probeset-summarize generate a quality assessment report like EC?

A. Yes. There will be a report.txt output file (ie rma-sketch.report.txt) which includes many of the same quality assessment metrics.

Q. What happens to non-pm probes when pulled through the rma-bg chipstream?

A. The background parameterization is based only on PM probes. If methods request an intensity value for a non-PM probe, then the probe itensity of that non-PM probe will be background adjusted just as if it were a PM probe. The following is an example of an analysis stream where this would occur:

        rma-bg,quant-norm,pm-mm,plier
   
Of course combining the global background correction from RMA with PLIER and a probe specific correction using the MM may result in poor results.

Q. What is a probe_id?

A. See the FAQ item on probe IDs for more info.


Generated on Thu Jul 10 09:25:16 2008 for Affymetrix Power Tools by  doxygen 1.5.3