VIGNETTES: Use of APT to Analyze DMET Plus Arrays

Date:
2011-04-07

Contents

Introduction

Affymetrix Power Tools includes command line access to the computational engines used for the DMET genotyping and copy number analysis in the DMET Console application. This access provides both flexibility to incorporate the computational engine into custom bioinformatic pipelines and also exposes certain options not currently available through DMET Console. This vignette outlines some examples of command line processing of DMET data using different available modes of analysis.

You will require the DMET_Plus genotyping library files and the DMET Plus annotation and translation files. The library files include all supported versions of the input reference file for the DMET Plus array. In addition, you need version 1.12.0 or later of the Affymetrix Power Tools. The CHP files generated by the workflows presented here can be incorporated into DMET Console for further analysis and visualization.

The DMET_Plus genotyping library files are assumed to reside in a directory indicated by the path stored in $LIB_DIR. The DMET Plus annotation and translation files are assumed to reside in a directory indicated by the path stored in $ANNOT_DIR.

Most of the library files remain the same moving among DMET Console versions 1.0, 1.1, and 1.2; only the genomic input reference and plasmid input reference files are revised. The different reference files include the following:

File DNA Type Fixed/Dynamic Version
DMET_Plus.v1.genomic.ref.a5 Genomic Fixed 1
DMET_Plus.v1.genomic.ref.r2.a5 Genomic Dynamic 1
DMET_Plus.v1.genomic.ref.r3.a5 Genomic Fixed 2
DMET_Plus.v1.genomic.ref.r4.a5 Genomic Dynamic 2
DMET_Plus.v1.plasmid.ref.a5 Plasmid Fixed 1
DMET_Plus.v1.plasmid.ref.r2.a5 Plasmid Dynamic 1

The general advice is to use the newest versions of the reference files, either fixed boundary or dynamic boundary as desired. The original reference files are provided principally for compatibility with previous work.

Fixed boundary analysis

The following call to apt-dmet-genotype processes the CEL files in the same manner as the legacy "Fixed Genotype Boundaries" analysis configuration for DMET Console. Other DMET Plus input reference files will function properly for single-sample analysis, but they will not yield numerical results that match those generated by DMET Console. In fixed-boundary analysis, samples are analyzed one-at-a-time and signals are compared against predefined clustering models to make a genotype call. There are separate clustering models for plasmid controls and genomic samples, so make sure to match the right models with the right sample type.

  apt-dmet-genotype \
    --sample-type                genomic \
    --out-dir                    output \
    --cdf-file                   $LIB_DIR/DMET_Plus.v1.cdf \
    --cel-files                  cel_files.txt \
    --run-cn-engine              true \
    --chrX-probes                $LIB_DIR/DMET_Plus.v1.chrXprobes \
    --chrY-probes                $LIB_DIR/DMET_Plus.v1.chrYprobes \
    --special-snps               $LIB_DIR/DMET_Plus.v1.specialSNPs \
    --region-model               $LIB_DIR/DMET_Plus.v1.cn-region-models.txt \
    --probeset-model             $LIB_DIR/DMET_Plus.v1.cn-probeset-models.txt \
    --reference-input            $LIB_DIR/DMET_Plus.v1.genomic.ref.a5 \
    --gt-analysis                quant-norm.sketch=50000,pm-only,brlmm-p-multi\
.CM=2.bins=100.mix=1.bic=2.lambda=0.0.HARD=3.SB=0.75.KX=0.3.KH=0.3.KXX=0.1\
.KAH=-0.1.KHB=-0.1.KYAH=-0.05.KYHB=-0.05.KYAB=-0.1.transform=MVA.AAM=2.8\
.BBM=-2.8.AAV=0.10.BBV=0.10.ABV=0.10.V=1.AAY=10.7.ABY=11.3.BBY=10.7\
.copyqc=0.00000.wobble=0.0.MS=0.1.copytype=-1.clustertype=2.CSepPen=0.5\
.ocean=0.0000001.cc-alleles=6.cc-type=UCHAR.cc-version=1.0 \
    --geno-call-thresh           0.1 \
    --cc-chp-output              true \
    --probeset-ids               $LIB_DIR/DMET_Plus.v1.genomic.gt.ps \
    --cn-region-gt-probeset-file $LIB_DIR/DMET_Plus.v1.cn-gt.ps \
    --chip-type                  DMET_Plus
  apt-probeset-genotype --explain brlmm-p

Examine the documentation for apt-dmet-genotype for a detailed explanation of all available options and their meanings.

To process plasmids, set --sample-type to plasmid, --reference-input to DMET_Plus.v1.plasmid.ref.a5, and --probeset-ids to DMET_Plus.v1.plasmid.gt.ps, respectively.

The following call to apt-dmet-genotype processes the CEL files in the same manner as the "Fixed Genotype Boundaries — version 2" analysis configuration for DMET Console.

  apt-dmet-genotype \
    --sample-type                genomic \
    --out-dir                    output \
    --cdf-file                   $LIB_DIR/DMET_Plus.v1.cdf \
    --cel-files                  cel_files.txt \
    --run-cn-engine              true \
    --chrX-probes                $LIB_DIR/DMET_Plus.v1.chrXprobes \
    --chrY-probes                $LIB_DIR/DMET_Plus.v1.chrYprobes \
    --special-snps               $LIB_DIR/DMET_Plus.v1.specialSNPs \
    --region-model               $LIB_DIR/DMET_Plus.v1.cn-region-models.txt \
    --probeset-model             $LIB_DIR/DMET_Plus.v1.cn-probeset-models.txt \
    --reference-input            $LIB_DIR/DMET_Plus.v1.genomic.ref.r3.a5 \
    --gt-analysis                quant-norm.sketch=50000,pm-only,brlmm-p-multi\
.CM=2.bins=100.mix=1.bic=2.lambda=0.0.HARD=3.SB=0.75.KX=0.3.KH=0.3.KXX=0.1\
.KAH=-0.1.KHB=-0.1.KYAH=-0.05.KYHB=-0.05.KYAB=-0.1.transform=MVA.AAM=2.8\
.BBM=-2.8.AAV=0.10.BBV=0.10.ABV=0.10.V=1.AAY=10.7.ABY=11.3.BBY=10.7\
.copyqc=0.00000.wobble=0.0.MS=0.001.copytype=-1.clustertype=2.CSepPen=0.5\
.ocean=0.000000000000001.cc-alleles=6.cc-type=UCHAR.cc-version=1.0 \
    --geno-call-thresh           0.001 \
    --cc-chp-output              true \
    --probeset-ids               $LIB_DIR/DMET_Plus.v1.genomic.gt.ps \
    --cn-region-gt-probeset-file $LIB_DIR/DMET_Plus.v1.cn-gt.ps \
    --chip-type                  DMET_Plus \

The three changes from the previous call include the following:

Dynamic boundary analysis

It is possible that either the target preparation or the experimental conditions for a set of samples may differ sufficiently from those used for the training data of DMET Plus that dynamic-boundary analysis (i.e. adaptation of the cluster boundaries to new locations suggested by the empirical data) may improve the results. In cases where it provides an improvement, the usual effect is to shift cluster boundaries and increase the call rate; the accuracy typically remains about the same.

It is most appropriate to cluster samples in batches that correspond to equivalent processing conditions, that is, samples that share the same systematic cluster shifts. Trying to adapt to the more random spread associated with different assay conditions tends to expand the cluster variance rather than shifting the cluster. For example, if you had two batches of samples each of which had a different degree and type of cluster shifting, it would be best to perform two clustering runs, one for each batch separately, rather than a single run on all the samples.

In an effort to reduce the possible loss of accuracy associated with the dynamic-boundary analysis relative to the fixed-boundary analysis, the reference file and some algorithm parameters have some changes relative to the fixed boundary equivalents. The following call yields the same results as the legacy "Dynamic Genotype Boundaries" analysis configuration for DMET Console.

  apt-dmet-genotype \
    --sample-type                genomic \
    --out-dir                    output \
    --cdf-file                   $LIB_DIR/DMET_Plus.v1.cdf \
    --cel-files                  cel_files.txt \
    --run-cn-engine              true \
    --chrX-probes                $LIB_DIR/DMET_Plus.v1.chrXprobes \
    --chrY-probes                $LIB_DIR/DMET_Plus.v1.chrYprobes \
    --special-snps               $LIB_DIR/DMET_Plus.v1.specialSNPs \
    --region-model               $LIB_DIR/DMET_Plus.v1.cn-region-models.txt \
    --probeset-model             $LIB_DIR/DMET_Plus.v1.cn-probeset-models.txt \
    --batch-name                 genomic_test_20110404 \
    --batch-info                 true \
    --reference-output           refOutput.txt \
    --reference-input            $LIB_DIR/DMET_Plus.v1.genomic.ref.r2.a5 \
    --gt-analysis                quant-norm.sketch=50000,pm-only,brlmm-p-multi\
.CM=1.bins=100.mix=1.bic=2.lambda=0.0.HARD=3.SB=0.75.KX=0.3.KH=0.3.KXX=0.1\
.KAH=-0.1.KHB=-0.1.KYAH=-0.05.KYHB=-0.05.KYAB=-0.1.transform=MVA.AAM=2.8\
.BBM=-2.8.AAV=0.10.BBV=0.10.ABV=0.10.V=1.AAY=10.7.ABY=11.3.BBY=10.7\
.copyqc=0.00000.wobble=0.0.MS=0.1.copytype=-1.clustertype=2.CSepPen=0.5\
.ocean=0.0000001.cc-alleles=6.cc-type=UCHAR.cc-version=1.0 \
    --geno-call-thresh           0.1 \
    --cc-chp-output              true \
    --probeset-ids               $LIB_DIR/DMET_Plus.v1.genomic.gt.ps \
    --cn-region-gt-probeset-file $LIB_DIR/DMET_Plus.v1.cn-gt.ps \
    --chip-type                  DMET_Plus \

The call is identical to the original fixed-boundary analysis except for four settings:

To process plasmids, set --sample-type to plasmid, --reference-input to DMET_Plus.v1.plasmid.ref.r2.a5, and --probeset-ids to DMET_Plus.v1.plasmid.gt.ps, respectively.

The following call yields the same results as the "Dynamic Genotype Boundaries — version 2" analysis configuration for DMET Console.

  apt-dmet-genotype \
    --sample-type                genomic \
    --out-dir                    output \
    --cdf-file                   $LIB_DIR/DMET_Plus.v1.cdf \
    --cel-files                  cel_files.txt \
    --run-cn-engine              true \
    --chrX-probes                $LIB_DIR/DMET_Plus.v1.chrXprobes \
    --chrY-probes                $LIB_DIR/DMET_Plus.v1.chrYprobes \
    --special-snps               $LIB_DIR/DMET_Plus.v1.specialSNPs \
    --region-model               $LIB_DIR/DMET_Plus.v1.cn-region-models.txt \
    --probeset-model             $LIB_DIR/DMET_Plus.v1.cn-probeset-models.txt \
    --batch-name                 genomic_test_20110404 \
    --batch-info                 true \
    --reference-output           refOutput.txt \
    --reference-input            $LIB_DIR/DMET_Plus.v1.genomic.ref.r4.a5 \
    --gt-analysis                quant-norm.sketch=50000,pm-only,brlmm-p-multi\
.CM=1.bins=100.mix=1.bic=2.lambda=0.0.HARD=3.SB=0.75.KX=0.3.KH=0.3.KXX=0.1\
.KAH=-0.1.KHB=-0.1.KYAH=-0.05.KYHB=-0.05.KYAB=-0.1.transform=MVA.AAM=2.8\
.BBM=-2.8.AAV=0.10.BBV=0.10.ABV=0.10.V=1.AAY=10.7.ABY=11.3.BBY=10.7\
.copyqc=0.00000.wobble=0.0.MS=0.001.copytype=-1.clustertype=2.CSepPen=0.5\
.ocean=0.00000000001.cc-alleles=6.cc-type=UCHAR.cc-version=1.0 \
    --geno-call-thresh           0.001 \
    --cc-chp-output              true \
    --probeset-ids               $LIB_DIR/DMET_Plus.v1.genomic.gt.ps \
    --cn-region-gt-probeset-file $LIB_DIR/DMET_Plus.v1.cn-gt.ps \
    --chip-type                  DMET_Plus \

The three changes from the previous call include the following:

Accessing the genotyping and copy number results

Having run apt-dmet-genotype, you may wish to feed the resulting copy number and genotyping results into your bioinformatic pipeline directly. This connection to your pipeline may be facilitated through the generation of text-based output that simplifies the parsing of the output. Adding the following sequence to the end of the options for apt-dmet-genotype will generate text-formatted output files:

  -- --summaries=true --feat-effects -- --text-output -- --table-output --feat-effects --summaries

The output directory created by apt-dmet-genotype contains four sub-directories:

The adc directory includes the dmet.copynumber.txt file, which summarizes the copy number analysis based on results from specially designed copy-number probes as well as from genotyping markers. The apg directory includes the dmet.calls.txt, dmet.forced-calls.txt, and dmet.confidences.txt files. The dmet.calls.txt file contains the genotyping calls, and the dmet.confidences.txt file contains the associated confidences. When the confidence value exceeds a threshold (controlled by the MS parameter in the --gt-analysis algorithm string above), the associated genotype is a no-call. The dmet.forced-calls.txt contains the genotyping calls ignoring the confidence threshold, so there is a call in every case.

The dmetchp.html file in the Affymetrix Fusion SDK file format description explains the various codes used in the files. Although the dmetchp.html file refers to a binary format, the same codes appear in the text output.

In the case of genotyping, the preceding mapping of genotyping codes results in abstract allele symbols (e.g. AA or AB) rather than the actual bases in the DNA sequence. One of the DMET annotation files, DMET_Plus.v1.20110329.dc_annot.csv, provides a mapping from the abstract allele symbols to the DNA bases. Below are relevant fields in this file.

Probe Set ID Common Name Design Strand Alleles Design Strand Allele Code Switch Design Strand to Report Alleles Reported Strand Reported Strand
AM_10001 CHST3_c.*1130C>G 1 C // G A // B 0 C // G +
AM_10002 CHST3_c.*1155G>CorGG 1 C // G // GG A // B // C 0 C // G // GG +
AM_10003 CHST3_c.*1278C>T 1 C // T A // B 0 C // T +
AM_10004 CHST3_c.*1314G>A 1 A // G A // B 0 A // G +
AM_10005 CHST3_c.*1361C>T 1 C // T A // B 0 C // T +
AM_10006 CHST3_c.*1844T>C -1 A // G A // B 1 T // C +

The order of the abstract allele code names in the Allele Code column is the same order as the reported allele names, represented in the Alleles Reported Strand column. For example, if the genotype number code for a sample at marker AM_10002 is 38 in dmet.calls.txt, a lookup in the appropriate table in the dmetchp.html file yields the abstract letter code genotype C|C. From the above table, this abstract genotype corresponds to GG|GG.

Affymetrix Power Tools (APT) Release 1.14.3