Linking Files Through GUIDs

Bookmark and Share

The new Affymetrix GeneChip Command Console (AGCC) binary file and sample attribute (*.ARR) file provides for storage of GUIDs to aid in the identification of the file and its relationship to other files. Examples of relationships are the parent file associations, such as the CEL file used to create the CHP file, or the group of files associated with a batch analysis.

This whitepaper will provide an overview of the format of the header of the AGCC binary files and present a few use cases on how to take advantage of the GUIDs stored in the file header.

Additional information regarding the file format and for C++ and Java parseres are containing with in the Fusion SDK.

The AGCC software will store the sample attributes entered by the user in an XML file using an "ARR" extension. This file has the provision to store zero, one or more name/value/type attributes and references to zero, one or more physical arrays. An example of this is a user may define a single ARR file to store the NSP and STY arrays for a 500K array set. The reference to the physical array is a GUID to uniquely identify the array.

The header of the AGCC-formatted binary file stores the following information. Note, this is not a complete list of the contents of the file header.

  • A GUID to uniquely identify the file. Since the GUID is embedded within the contents of the file, renaming of the file will have no affect on your ability to identify the relationship of this file to another file or array.
  • An identifier as to the file type
  • An array of name/value/type parameters (algorithm parameters, program name, company name, summary statistics are examples of items stored)
  • The header (the above items) of the parent file. This is only for parents that are also in the AGCC-format. The parent header will also contain its parent header, thus providing the complete lineage of how the file was created. As an example a CHP file will contain its parent CEL file header which in turn contains its parent DAT file header which in turn contains a reference to the physical array (the physical array GUID stored in the ARR file).

#1: Linking analysis results stored in a CHP file to the parent CEL file and the associated sample attributes stored in the ARR file.

The header of the CHP file provides a copy of the header of the CEL file. This header includes the GUID to identify the CEL file. Since we are linking the files by GUID, we will not be dependent on the file name. The files may be named differently or the the files may be renamed after creation. Either scenario will not affect the ability to link the files.

The header of the CHP files also provides a copy of the GUID of the physical array. This is obtained by traversing the parent header section until the ARR header is found. The GUID of the physical array will be stored in the ARR header section.

So given the CHP file header we can extract the GUID of the CEL file used to create the CHP file and the GUID of the physical array. By extracting the GUID stored in the CEL files we can make a link between the CHP and parent CEL. Also, by extracting the GUID's stored in the ARR files we can make a link between the CHP file and the ARR file.

#2: Identifying files analyzed in a batch.

We will use the same technique of using a GUID to group CHP files analyzed together in a batch. This time we will search the parameter section of the CHP file. Affymetrix software stores a GUID associated with a batch run as one of the name/value/type parameters. The CHP file parameter will contain "exec-guid" as part of its name. All of the CHP files created during the batch analysis will have the same "execution" GUID.

So given the list of CHP files, one can determine if they were analyzed as part of the same batch.