Affymetrix® BPMAP File Format

`

BPMAP FILE

Description

The BPMAP file contains information relating to the design of the Affymetrix tiling arrays.

Version 2 added the ability to a version, group and parameters associated with each sequence item.

Version 3 added the ability to store perfect match probes in addition to probe pairs.

Format

The format of the BPMAP file is a binary file with data stored in big-endian format. The following lists the sections and their order and placement in the file. The definition of each section is detailed below.

File Header

Sequence Description for sequence #1
Sequence Description for sequence #2
...
Sequence Description for sequence #N

Sequence Header for sequence #1
Position Information for probe/probe pair #1 of sequence #1
Position Information for probe/probe pair #2 of sequence #1
...
Position Information for probe/probe pair #M of sequence #1

Sequence Header for sequence #2
Position Information for probe/probe pair #1 of sequence #2
Position Information for probe/probe pair #2 of sequence #2
...
Position Information for probe/probe pair #M of sequence #2

...

Assuming there are N sequences and M_i probe pairs for sequence i.

Section Definitions

File Header

Item DescriptionType Size
1 Magic number. A value to identify the file type. The value is set to 'PHT7\r\n\032\n' char8 bytes
2 The version number of the file. The version number is either 1.0, 2.0 or 3.0.

Due to a bug with the BPMAP file writer for early access arrays, this value may not be stored as a big endian float. To read this value:

When on a big endian machine: read 4 bytes, swap the direction of the bytes, cast this to an integer, swap the bytes and cast to a float.

When on a little endian machine: read 4 bytes, cast the value as an integer, swap by bytes and cast to a float.

float 4 bytes
3 Number of sequences stored in the file.unsigned int 4 bytes

Sequence Description

Item DescriptionType Size
1 Length of the sequence name. unsigned int4 bytes
2 Sequence name. charSpecified by item #1.
3 Probe mapping type. (only for version 3.0 and above files)

0 indicates a (PM/MM) probe pair tiling across the sequence.
1 indicates a PM-only tiling across the sequence.

unsigned int4 bytes
4 Sequence file offset. (only for version 3.0 and above files)

The offset (in bytes), from the beginning of the file, of the probe position information. This is intended to enable fast look-up ability.

unsigned int4 bytes
5 Number of probes/probe pairs in the sequence.unsigned int 4 bytes
6 Length of the group name (only for version 2.0 and above files) unsigned int 4 bytes
7 Group name (only for version 2.0 and above files)char Specified by item #4.
8 Length of the version number (only for version 2.0 and above files) unsigned int 4 bytes
9 Version number (only for version 2.0 and above files) char Specified by item #6
10 Number of parameters (only for version 2.0 and above files) unsigned int 4 bytes
11 Parameters name/value. The number of parameters is specified by item #8. (only for version 2.0 and above files).

Each parameter is defined as a pair of name/value strings where the strings are stored as the following:

  • unsigned int (4 bytes) - This is the length of string.
  • char (# characters defined by the length of the string) - This is the name of the string.
  • see the description. see the description.

    Sequence Header

    Item DescriptionType Size
    1 Sequence ID unsigned int4 bytes

    Position Information

    Item DescriptionType Size
    1 X coordinate on array of the perfect match (PM) probe (note: array coordinates are 0 based). unsigned int4 bytes
    2 Y coordinate on array of the PM probe unsigned int4 bytes
    3 X coordinate on array of the mismatch probe (MM) probe (only if the probe mapping type indicates PM/MM tiling)unsigned int4 bytes
    4 Y coordinate on array of the MM probe (only if the probe mapping type indicates PM/MM tiling)unsigned int 4 bytes
    5 Length of the PM probe (and MM if a pair).unsigned char 1 byte
    6 Probe sequence. The 25 base probe sequence is packed into a 7 byte character sequence.

    Each byte represents up to 4 bases (so the format can handle probes of length up to 25bp).
    The first byte contains the first 4 bases of the probe.
    The first base of the probe is encoded in the two most significant bits of the first byte.
    The fourth base of the probe is encoded in the two least significant bits of the first byte.
    The conversion from each pair of bits to a DNA base is as follows: (0,1,2,3) -> (A,C,G,T)

    char 7 bytes
    7 Match score. Note: The current BPMAP files are based on perfect match so the scores are 1.

    See the bug description in the  version number field above.

    float4 bytes
    8 Position of the PM probe within the sequence. Note: The position is the 0-based position of the lower coordinate of the 25-mer aligned to the target. unsigned int4 bytes
    9 1 if the matching target (not the probe) is on the forward strand, 0 if on the reverse. unsigned char1 byte