Affymetrix® CDF Data File Format
CDF FILE
Description
The CDF file describes the layout for an Affymetrix GeneChip array. An array may contain Expression, Genotyping, CustomSeq, Copy Number and/or Tag probe sets. All probe set names within an array are unique. Multiple copies of a probe set may exist on a single array as long as each copy has a unique name.
The information below will describe the following versions:
- ASCII text format is used by the MAS and GCOS 1.0 software. This was also known as the ASCII version.
- XDA format is used by the GCOS 1.2 and above software. This was also known as the binary or XDA version.
The format of this CDF file is an ASCII text file similar to the Windows INI format.
The file is divided up into sections. The start of each section is defined by a line containing a section name enclosed in square braces. The section names are: "CDF", "Chip", "QCI" (where I ranges from 1 to the number of QC probe sets), "Unit J" (where J is an internal index to uniquely distinguish probe sets), and "Unit J_Block K" (where J and K are internal indices used to distinguish subsets of a probe set). The data in each section is of the format TAG=VALUE.
The "CDF" section contains the version number of the file. The TAGS are:
TAG | Description |
---|---|
Version | The version number. Should always be set to "GC1.0", "GC2.0", "GC3.0", "GC4.0", "GC5.0", "GC6.0", or "GC7.0". This document describes GC3.0, GC4.0, GC5.0, and GC6.0 version CDF files. |
GUID | The unique identifier of the CDF. (Only available in version 6 or 7) |
md5 | The integrity md5 of the CDF. (Only available in version 6 or 7) |
The "Chip" section contains the following TAGS:
TAG | Description |
---|---|
Name | The name of the array. This item is not used by the software. |
ChipType | The probe array type. Multiple entries may exist. (Only available in version 6 or 7) |
Rows | The number of rows of cells on the array. |
Cols | The number of columns of cells on the array. |
NumberOfUnits | The number of units in the array not including QC units. For CustomSeq arrays, there are 2 units: Unit1 contains the probes interrogating a sense target and Unit2 contains the probes interrogating an anti-sense target. For all other array types, there exists one unit per probe set. |
MaxUnit | Each unit is given a unique number. This value is the maximum of the unit numbers of all the units in the array (not including QC units). |
NumQCUnits | The number of QC units. QC units are defined in version 2 and above. CustomSeq arrays do not contain any QC units. |
ChipReference | Used for CustomSeq, HIV and P53 arrays only. This is the reference sequence displayed by the Affymetrix software. The sequence may contain spaces. This value is defined for version 2 and above. |
The next set of sections where the name begins with "QC" define the QC units or probe sets in the array. There are NumQCUnits (from the Chip section) QC sections.
Each section name is a combination of "QC" and an index ranging from 1 to NumQCUnits-1 and will be listed sequentially. QC units are defined for version 2 and above.
Each section contains the following TAGS:
TAG | Description |
---|---|
Type | Defines the type of
QC probe set. The defined types are:
0 - Unknown |
NumberCells | The number of cells in the probe set. |
CellHeader | Defines the data
contained in the subsequent lines, separated by tabs.
For all QC probe set types: The final data items are dependent on the
type of the QC probe set: |
Celli | This contains the information about a cell that belongs to the probe set. The value of i in the tag ranges from 1 to the number of cells in the probe set and will be listed sequentially. The values in each line depend on the CellHeader. The values are separated by tabs. |
The next set of sections where the name begins with "Unit" define the probes that are a member of the unit (probe set). Each unit is divided into subsections termed "Blocks" which are referred to as "groups" in the Files SDK documentation.
Each section name is a combination of "Unit" and an index. There is no meaning to the index value. Immediately following the "Unit" section there will be the "Block" sections for that unit before the next unit is defined.
Each "Unit" section contains the following TAGS:
TAG | Description |
---|---|
Name | The name of the unit. The probe set name for Genotyping, Copy Number, Polymorphic Marker and Multichannel Marker units or "NONE" for all other unit types. |
Direction | Defines if the probes are interrogating a sense target or anti-sense target (1 - sense, 2 - anti-sense, 3 - both). |
NumAtoms | The number of atoms in the entire probe set. This TAG name contain two values after the equal sign. The first is the number of atoms and the second (if found) is the number of cells in each atom. An atom is a probe quartet for CustomSeq units and a probe pair for all other unit types. |
NumCells | The number of cells in the entire probe set. Probe pairs contain 2 cells and probe quartets contain 4 cells. |
UnitNumber | An arbitrary index value for the probe set. |
UnitType | Defines the type of unit (0 - Unknown, 1 - CustomSeq, 2 - Genotyping, 3 - Expression, 7 - Tag/GenFlex, 8 - Copy Number, 9 - Genotyping Control, 10 - Expression Control, 11 - Polymorphic Marker, 12 - Multichannel Marker). An array may contain units of varying types. |
NumberBlocks | The number of blocks or groups in the probe set. |
MutationType | Used for Genotyping units only in defining the type of polymorphism (0 - substitution, 1 - insertion, 2 - deletion). This value is available in version 2 and above. |
After the "Unit" section follows the "Unit_Block" sections. There are as many "Unit_Block" sections as defined by NumberBlocks. A block will list the probes as its members.
The TAGS are:
TAG | Description |
---|---|
Name | The name of the block. For Genotyping units this is the allele. For Polymorphic Marker and Multichannel Marker units this is "None". For all other unit types this is the name of the probe set. |
BlockNumber | An index to the block. |
Wobble |
The wobble
situation for Polymorphic Marker and Multichannel Marker units in the
block.
Only available in
version 4, 5, 6, and 7. |
Allele |
The allele code for Polymorphic Marker and Multichannel Marker units in the block. Only available in version 4, 5, 6, and 7. |
Channel |
The channel
code for multichannel microarray platform. Only available in version
5, 6, and 7. |
RepType |
The probe replication type (0 - unknown, 1 - different probe sequences, 2 - some probe sequences are identical, 3 - all probe sequences are identical) for probe set groups used under multichannel microarray platform. Only available in version 5, 6, and 7. |
NumAtoms | The number of atoms in the block. |
NumCells | The number of cells in the block. |
StartPosition | The position of the first atom. |
StopPosition | The position of the last atom. |
Direction | Used for Genotyping, Polymorphic Marker and Multichannel Marker units only in defining whether the probes are interrogating a sense target or anti-sense target (0 - no direction, 1 - sense, 2 - anti-sense). This value is available in version 3 and above. |
CellHeader | Defines the data
contained in the subsequent lines, separated by tabs. The values are:
X- The X coordinate of the
cell. The following are only available in version
2 and above: |
Celli | This contains the information about a cell that belongs to the block. The value of i in the tag ranges from 1 to the number of cells in the block. The values in each line depend on the CellHeader. The values are separated by tabs. |
The format of this CDF file is a binary file created for faster access and smaller file size. The values in the file are stored in little-endian format.
The file contents are defined by:
Item | Description | Type |
---|---|---|
1 | Magic number. Always set to 67. | integer |
2 | Version number.
Should set to 1, 2, 3, 4, or 5. |
integer |
3 | The length of the GUID, an unique identifier of the CDF.
(Only available in version 4) |
unsigned integer |
4 | GUID, the unique identifier of the CDF.
(Only available in version 4) |
char[length defined above] |
5 | The integrity md5 of the CDF.
(Only available in version 4) |
char[32] |
6 | The number of probe array types.
(Only available in version 4) |
unsigned char |
7 | The length of probe array type.
(Only available in version 4) |
unsigned integer |
8 | The probe array type.
(Only available in version 4) |
char[length defined above] |
9 | The length and value of probe array type
as described in Item 7 and 8 respectively if there is more than one entry.
(Only available in version 4) |
(unsigned integer + char[length defined]) * (# of probe array types - 1) |
10 | The number of columns of cells on the array. | unsigned short |
11 | The number of rows of cells on the array. | unsigned short |
12 | The number of units in the array not including QC units. The term unit is an internal term which means probe set. | integer |
13 | The number of QC units. | integer |
14 | The length of the CustomSeq reference sequence. | integer |
15 | The CustomSeq reference sequence. | char[ length defined above] |
16 | The probe set name. The UNIT name for CustomSeq, Genotyping, Polymorphic Marker, and Multichannel Marker. The BLOCK name for Expression. | char[64] * (# of units) |
17 | File position for the start of each QC unit information block. | integer * (# of QC units) |
18 | File position for the start of each unit information block. | integer * (# of units) |
19 | QC information,
repeated for each QC unit:
Type - unsigned short Probe information, repeated for each probe in the QC unit: X coordinate - unsigned shortY coordinate - unsigned short Probe length - unsigned char Perfect match flag - unsigned char Background probe flag - unsigned char |
see description |
20 | Unit information,
repeated for each unit:
UnitType - unsigned short (1 - Expression, 2
- Genotyping, 3 - CustomSeq, 4 - Tag, 5 - Copy Number, 6 - Genotyping
Control, 7 - Expression Control, 8 - Polymorphic Marker, 9 -
Multichannel Marker) Block information, repeated for each block in the unit: Number of atoms - integer Cell information, repeated for each cell in the block: Atom number - integer |
see description |