VIGNETTES: Use of APT to Analyze WT Based Exon and Gene Expression Arrays

Date:
2007-05-31

Contents

Introduction

You can use the APT program, apt-probeset-summarize, to compute gene and exon level signal estimates from Exon array data and gene level estimates for Gene array data.

Analysis Library Files

To use apt-probeset-summarize to analyze WT-based Gene and Exon expression arrays you first need to obtain the necessary library files:

Note that you can specify either a MPS or PS file, not both in a single analysis.

Quick Start Exon Array

Step 1: Download the analysis library files

For Exon Arrays (ie Human Exon 1.0 ST Array), all of the files needed are in the analysis library file package (zip archive file) that can be downloaded from the respective array support page. (For example, the Analysis zip file under the "Library Files" section of the Human Exon 1.0 ST Array - Support Materials page.)

Direct links for various exon array analysis library file packages:

NOTE: If you have Expression Console (EC) installed on a windows box, you can use EC to download the library files and then simply copy the library files you need from the EC library file folder.

NOTE: Earlier versions (ie prior to June 2007) of the analysis library file zip archive did not include the MPS files.

Step 2: Use apt-probeset-summarize to compute GENE level estimates

Here is a basic example you might run from the bash *NIX shell:
    apt-probeset-summarize \
        -p HuEx-1_0-st-v2.r2.pgf \
        -c HuEx-1_0-st-v2.r2.clf \
        -b HuEx-1_0-st-v2.r2.antigenomic.bgp \
        --qc-probesets HuEx-1_0-st-v2.r2.qcc \
        -m HuEx-1_0-st-v2.r2.dt1.hg18.core.mps \
        -a rma-sketch \
        -o output-gene \
        *.CEL

This assumes that apt-probeset-summarize is in your PATH. One might wonder why a bgp file is specified when we are not using the pm-gcbg pm adjustor (rma-sketch uses pm-only). The reason is that APT will still use the bgp information to compute the bgrd mean quality metric in the report file.

Note under Windows you need to make a couple changes compared to the *NIX command line. First everything must be on the same line (no "\" continuations). Second, the wild card "*.CEL" will not work. You need to either list every CEL file on the command line or use the --cel-files option. Here is an example using the later:

    ./apt-probeset-summarize -p HuEx-1_0-st-v2.r2.pgf -c HuEx-1_0-st-v2.r2.clf -b HuEx-1_0-st-v2.r2.antigenomic.bgp --qc-probesets HuEx-1_0-st-v2.r2.qcc -m HuEx-1_0-st-v2.r2.dt1.hg18.core.mps -a rma-sketch -o output-gene --cel-files celfiles.txt

The format of the file specified with the --cel-files option is a tab-separated file with a header line containing "cel_files". For example:

cel_files	rep	tissue
heart1.CEL	1	heart
heart2.CEL	2	heart
heart3.CEL	3	heart
brain1.CEL	1	brain
brain2.CEL	2	brain
brain3.CEL	3	brain

The notable option that makes this a gene level analysis is the inclusion of a meta probeset file, HuEx-1_0-st-v2.r2.dt1.hg18.core.mps, with the "-m" option.

The analysis is specified with the "-a" option. This can be either a full analysis specification or the name of a pre-canned analysis specification. For example the pre-canned "rma-sketch" analysis method could also be specified using the full analysis specification "rma-bg,quant-norm.sketch=-1.usepm=true.bioc=true,pm-only,med-polish". See the Analysis Methods section for other analysis options.

Output files will be generated in the "output-gene" subfolder. WARNING: apt-probeset-summarize will overwrite previously generated results. With the example command above, the following output files will be created:

Here is an example of running PLIER with the GC-bin background correction under the bash *NIX shell:

    apt-probeset-summarize \
        -p HuEx-1_0-st-v2.r2.pgf \
        -c HuEx-1_0-st-v2.r2.clf \
        -b HuEx-1_0-st-v2.r2.antigenomic.bgp \
        --qc-probesets HuEx-1_0-st-v2.r2.qcc \
        -m HuEx-1_0-st-v2.r2.dt1.hg18.core.mps \
        -a plier-gcbg-sketch \
        -o output-gene \
        *.CEL

In this case the same files will be generated, but with a "plier-gcbg-sketch" prefix rather than "rma-sketch". Also note that the bgp file is required due to the use of GC-bin background correction. In the RMA case above the bgp file was option and only included to ensure the production of the bgrd mean quality assessment metric.

Note that these two analysis could be performed in a single APT run by including both "-a" parameters. For example:

    apt-probeset-summarize \
        -p HuEx-1_0-st-v2.r2.pgf \
        -c HuEx-1_0-st-v2.r2.clf \
        -b HuEx-1_0-st-v2.r2.antigenomic.bgp \
        --qc-probesets HuEx-1_0-st-v2.r2.qcc \
        -m HuEx-1_0-st-v2.r2.dt1.hg18.core.mps \
        -a plier-gcbg-sketch \
        -a rma-sketch \
        -o output-gene \
        *.CEL

Step 3: Use apt-probeset-summarize to compute EXON level estimates

Here is a basic example you might run from the bash *NIX shell:
    apt-probeset-summarize \
        -p HuEx-1_0-st-v2.r2.pgf \
        -c HuEx-1_0-st-v2.r2.clf \
        -b HuEx-1_0-st-v2.r2.antigenomic.bgp \
        --qc-probesets HuEx-1_0-st-v2.r2.qcc \
        -s HuEx-1_0-st-v2.r2.dt1.hg18.core.ps \
        -a rma-sketch \
        -o output-gene \
        *.CEL

The only change compared to the gene level example above is the use of a probeset list instead of a meta probeset file. By specifying the probeset list HuEx-1_0-st-v2.r2.dt1.hg18.core.ps with the "-s" option, we are restricting the analysis to just the "core" supported exons rather than the whole chip. Note that this probeset list file includes controls, so they will be present in the output as well. One could omit the "-s" in which case all the probesets in the PGF file would be processed. This would also result in an exon level result file with the inclusion of various controls.

Quick Start Gene Array

Step 1: Download the analysis library files

For Gene Arrays (ie Human Gene 1.0 ST Array), all the needed files are in the analysis library file package (zip archive file) that can be downloaded from the respective array support page. (For example, the Analysis zip file under the "Library Files" section of the Human Gene 1.0 ST Array - Support Materials page.)

Direct links for various gene array analysis library file packages:

NOTE: If you have Expression Console (EC) installed on a windows box, you can use EC to download the library files and then simply copy the library files you need from the EC library file folder.

Step 2: Use apt-probeset-summarize to compute gene level estimates

Here is a basic example you might run from the bash *NIX shell:
    apt-probeset-summarize \
        -p HuGene-1_0-st-v1.r3.pgf \
        -c HuGene-1_0-st-v1.r3.clf \
        -b HuGene-1_0-st-v1.r3.bgp \
        --qc-probesets HuGene-1_0-st-v1.r3.qcc \
        -m HuGene-1_0-st-v1.r3.mps \
        -a rma-sketch \
        -o output-gene \
        *.CEL

This assumes that apt-probeset-summarize is in your PATH.

Note under Windows you need to make a couple changes compared to the *NIX command line. First everything must be on the same line (no "\" continuations). Second, the wild card "*.CEL" will not work. You need to either list every CEL file on the command line or use the --cel-files option. Here is an example using the later:

    ./apt-probeset-summarize -p HuGene-1_0-st-v1.r3.pgf -c HuGene-1_0-st-v1.r3.clf -b HuGene-1_0-st-v1.r3.bgp --qc-probesets HuGene-1_0-st-v1.r3.qcc -m HuGene-1_0-st-v1.r3.mps -a rma-sketch -o output-gene --cel-files celfiles.txt

The celfiles.txt file is a tab separated file with a header line. There must be a column "cel_files". Here is an example:

cel_files
heart1.CEL
heart2.CEL
heart3.CEL
brain1.CEL
brain2.CEL
brain3.CEL

The notable option that makes this a gene level analysis is the inclusion of a meta probeset file, HuGene-1_0-st-v1.r3.mps, with the "-m" option. With the current gene array library files, omission of the meta probeset file will still result in a gene level analysis. This is in part due to the fact that probes are already grouped by gene in the PGF file. The reason to include the meta probeset file is it will exclude certain control probesets from the analysis (ie we do not want to compute RMA estimates for the various GC background probe bins) and it will propagate information into the output files that a meta probeset file was used. There may be downstream analysis software that uses this information to distinguish a gene level analysis from an exon level analysis.

The analysis is specified with the "-a" option. This can be either a full analysis specification or the name of a pre-canned analysis specification. For example the pre-canned "rma-sketch" analysis method could also be specified using the full analysis specification "rma-bg,quant-norm.sketch=-1.usepm=true.bioc=true,pm-only,med-polish". See the Analysis Methods section for other analysis options.

Output files will be generated in the "output-gene" subfolder. WARNING: apt-probeset-summarize will overwrite previously generated results. With the example command above, the following output files will be created:

Other Command Line Options

apt-probeset-summarize also has a number of other options. As you get more comfortable using APT, you should take some time to familiarize yourself with these options. Some of the options that you may want to use include:

See the apt-probeset-summarize Manual and the online help (run "apt-probeset-summarize -h") for more information.

Annotation Quality and Meta Probeset Alternatives

The use of more speculative content in the meta probeset files will pull in more genes, but it may also attenuate the signal values reported for all genes depending on the analysis method. In general, new analysis methods (ie IterPLIER or PCA Feature Selection) are recommended when dealing with more speculative content. See the Analysis Methods section for more information about alternative analysis methods.

Two key resources for more information on this topic:

Analysis Methods

A variety of analysis methods are implemented in apt-probeset-summarize and you can combine various analysis components to create new methods as well. More information is in the apt-probeset-summarize Manual.

With regard to gene level analysis, the following two sources contain important information:

Some of the methods you may want to consider:

Frequently Asked Questions

Q. The "full" meta-probeset file for the Human Exon 1.0 ST Array does not use all 1.4 million probesets in the probeset_list column. Why not?

A. Probesets that either do not map uniquely to the genome or are classified as bounded probesets are not included in the MPS file and thus are not used in the calculation of gene level signals. Bounded probesets are probesets hitting an exon that falls within a larger gene structure, but for which there is no annotation evidence that the exon is spliced to other parts of the gene structure. For example, there may be genscan single exon sub-optimal prediction that falls within a known gene, but this single exon prediction is not associated with the other parts of the gene structure by other spliced annotations (See the Exon Probeset Annotations and Transcript Cluster Groupings whitepaper for more details.)

Q. A core exon analysis using PLIER in Expression Console produces signal estimates for 287,329 exon probeset IDs. However, if I look at the HuEx-1_0-st-probeset-annot.csv file or the HuEx-1_0-st-v2.r2.dt1.hg18.csv there are only 284,258 core exon probeset IDs in these two files. Why are there approximately 3,000 more probeset IDs in the exon core summary that are not included in the annotation files?

A. The library files for the exon array include the various control probesets in the probeset list files (*.ps) and the meta probeset files (*.mps). This ensures that the various controls are processed and that the QC report file is fully populated regardless of the analysis type. So the additional probesets are the various control probesets. The majority of these are the intron/exon controls (listed in QC report file as positive and negative controls) that are used to compute the positive/negative area under the curve (AUC) QC metric.

Q. The number of results reported from a Gene Array are greater than the number of probesets listed in the CSV annotation file. Why?

A. There are a number of controls that are included in the MPS files to ensure that all of the control metrics can be calculated. Some of these controls are not listed in the CSV annotation file.

Q. Where can I get meta probeset files?

A. For the Exon and Gene Arrays, the default meta probeset files are provided in the analysis library file zip archive.

For the Human, Mouse, and Rat Exon 1.0 ST Arrays, additional meta probeset files can be obtained from the respective array support page. You need to download one of the "Design Time Annotation Files". Specifically, one of the "Probeset" (not "Full") annotation files. For example, MoEx-1_0-st-v1 Annotations, Probeset, Mm6, CSV.

Note that prior to around June 2007 there were no meta probeset files in the Exon Array analysis library file zip archive.

Affymetrix Power Tools (APT) Release apt-1.10.1

Generated on Mon Nov 3 12:21:42 2008 for Affymetrix Power Tools by  doxygen 1.5.3