We provide annotation data for our GeneChip® catalog arrays in multiple formats. This document describes how we have encoded our array annotations in the MAGE-ML format in files available on the support pages for each array.
Understanding the MAGE-ML Probe Set Files
This document describes how we have encoded annotations for probe sets in our GeneChip catalog arrays in downloadable XML files using the MAGE-ML format. Before reading this document, you may wish to familiarize yourself with more general information about the MAGE-ML format
Note that to use the XML files, you generally need a copy of the MAGE-ML.dtd file, available at the MAGE-ML support page of the MGED Society (Microarray Gene Expression Data Society). Our XML files assume this DTD file will be located in the same directory as the XML files.
The MAGE-ML format allows a great deal of flexibility in how some types of data can be encoded. In order to maximize compatibility with files generated by other groups, we have chosen to follow the recommendations of the European Bioinformatics Institute with respect to these, and other, elements:
- <BioSequence> identifiers
- <Database_ref> identifiers
- <OntologyEntry> categories and values
- Species names
For annotations that are unique to Affymetrix, or for which standards have not yet been developed, we use only the <NameValueType> XML element, with each name attribute beginning with "Affy:".
The structure of each MAGE-ML annotation file follows the following general outline:
- XML Declarations
- The <MAGE-ML> element containing:
- A <PropertySets_assnlist> element
- A <AuditTrail_assnlist> element
- A <Description_package> element
- A <BioSequence_package> element containing:
- A <BioSequence_assnlist> containing one <BioSequence> element for each Probe Set
The first two lines of each file will always look like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE MAGE-ML SYSTEM "MAGE-ML.dtd">
These lines simply identify the content as a MAGE-ML file. They indicate that you will generally need a file called MAGE-ML.dtd which should be placed in the same directory as the MAGE-ML files. See general information about the MAGE-ML format for information on how to download the MAGE-ML.dtd file.
The <MAGE-ML> element
The next line will be the <MAGE-ML> element. The identifier attribute identifies which Affymetrix® GeneChip probe array is being annotated. The name of the chip is always pre-pended with "Affy:Transcript:". There is always one MAGE-ML file for each probe array, rather than one file for each Chip Set.
The <PropertySets_assnlist> element
The <PropertySets_assnlist> is an optional element at this location in MAGE-ML. We use it to hold <NameValueType> elements that describe the probe array in more detail.
value="Affymetrix Human Genome HG-U133 Chip Set"/>
value="Affymetrix Human Genome U133A Array"/>
The <AuditTrail_assnlist> element
The <AuditTrail_assnlist> contains an <Audit> element. The date attribute of the <Audit> element describes the date when this set of annotations was generated, in year-month-day format. This is not necessarily the same as the date when the MAGE-ML file itself was generated. Annotations are updated on a quarterly basis. The action attribute is invariably "modification".
<Audit date="2002-10-18" action="modification" />
The <Description_package> element
The <Description_package> lists the unique identifiers of all the databases that may be referenced later in the file. (The list is un-ordered and may include more databases than are actually referenced later in the file.) The following example is much shorter than the list included in any of our annotation files.
The <Biosequence_package> element
The <Biosequence_package> element contains the bulk of the annotation data in each MAGE-ML file. It consists of a single <Biosequence_assnlist> element, which contains a series of <BioSequence> elements. There is one <BioSequence> for each probe set on the probe array, and these may occur in any order.
The <BioSequence> element
Each <BioSequence> tag contains three attributes: identifier, name, and sequence.
- The identifier attribute is required. It will take the form identifier="Affy:Transcript:HG-U133A:1053_at". In this example, "Affy:Transcript:HG-U133A" identifies the array, and will always match the identifier attribute of the enclosing <MAGE-ML> element, and "1053_at" is the Probe Set ID.
- The name attribute is optional. When present, it contains the Title of the Probe Set. Not every Probe Set has a Title.
- The sequence attribute is optional. When present, it contains the sequence from which individual probe sequences were chosen. (The sequence can be an "Exemplar," "Consensus," or "Control" sequence, described in a <NameValueType> below.) At present we include sequence data with each probe set. But since this information can be transmitted more compactly in other file formats, we may remove the sequence data in the future.
name="replication factor C (activator 1) 2, 40kDa"
Each <BioSequence> element contains the following elements:
- A <PropertySets_assnlist>
- An optional <SequenceDatabases_assnlist>
- An optional <OntologyEntries_assnlist>
- A <PolymerType_assn>
- A <Type_assn>
- An optional <Species_assn>
The <PropertySets_assnlist> element
The <PropertySets_assnlist> contains mostly Affymetrix-specific annotations. This is optional from the MAGE-ML standpoint, but is always present in our <BioSequence> elements.
The following example (from "Affy:Transcript:HG-U133A:200048_s_at") illustrates the possible types of annotations that may be included in this list. Not every type of annotation is present for every Probe Set.
<NameValueType name="Affy:Sequence_ID" value="g5729888"/>
<NameValueType name="Affy:Transcript_ID" value="Hs.6396.0"/>
<NameValueType name="Affy:Sequence_Type" value="Exemplar"/>
value="1500074354 (3 transcripts, 2 gene clusters)"/>
<NameValueType name="Affy:Chromosomal_Location" value="1q21"/>
The <SequenceDatabases_assnlist> element
The <SequenceDatabases_assnlist> contains database references in <DatabaseEntry> elements. Although optional, the <SequenceDatabases_assnlist> element is rarely absent.
The <DatabaseEntry> element
Each <DatabaseEntry> element contains a single reference to an entry in a database. In its simplest form, it contains a database identifier, an accession number, and an optional URI (or URL).
We often need to provide additional information in <DatabaseEntry> elements. This is done through the use of a <PropertySets_assnlist>. The most common additional information we add is a Description string. But we also often indicate which type of annotation is being described by providing a value for Affy:Annotation_Category. For example, there may be multiple references to the RefSeq database in a probe set. But they may mean very different things:
value="Full-Length Reference Sequence"/>
value="gb:NM_004539.2 /DEF=Homo sapiens asparaginyl-tRNA
synthetase (NARS), mRNA. /FEA=mRNA /GEN=NARS
/PROD=asparaginyl-tRNA synthetase /DB_XREF=gi:7262387
/UG=Hs.181311 asparaginyl-tRNA synthetase
/FL=gb:BC001687.1 gb:D84273.1 gb:NM_004539.2"/>
In this example, the RefSeq accession number "NM_004539.2" is indicated as the source of the sequence data, and the RefSeq accession number "NM_004539" is indicated as being a "Full-Length Reference Sequence." This means that our probe set sequence might be based on information from both records, but the one indicated as the "Source" is the best representative sequence and is the accession number we use in retrieving annotation data from LocusLink.
The following table describes the sorts of data we currently include in <NameValueType> elements inside of <DatabaseEntry> elements. This represents our attempt to include all the annotations on the NetAffx Analysis Center web site in the MAGE-ML files, although some such annotations do not fit well into any more conventional places in MAGE-ML. Future improvements in our MAGE-ML files could involve changes to these annotations.
|Indicates a representative database reference from which the sequence of the Probe Set was based. These annotations are usually accompanied by a <NameValueType> with type "Description."|
value: Full-Length Reference Sequence
|References to accession numbers in the UniGene cluster represented by the probe set in addition to the one that is indicated as the sequence source.|
value: Probe Set ID
|We include one or more reference to the database "DB:netaffx" for each probe set. Only one of these will be a reference to the annotation of the probe set itself. (The others, if any, will be HomoloGene references.) The accession number is already included in the <BioSequence> element's identifier attribute, but by also including a <DatabaseEntry>, we can include a complete URL linking to the full record on our web site.|
type: (Species and Array Name)
|HomoloGene references are always references to the database "DB:netaffx". The type attribute indicates which array is being referenced. In those cases where the same probe set exists on multiple arrays with the same name (for example 204780_s_at exists on both HG-Focus and HG-U133A), there can be multiple <NameValueType> tags with value of "HomoloGene" in a single <DatabaseEntry> element.|
value: Protein Families
|This simply indicates which portion of the annotation record generates the data. This is often used with entries from the database "DB:ec". For each record, there can be at most one reference to "DB:ec" that is not indicated as being a "Protein Families" annotation. There can be zero, one, or many references to "DB:ec" that are "Protein Families" annotations.|
value: Protein Similarities
type: BLAST (NR)
|These annotations are usually accompanied by a <NameValueType> with type "Description."|
value: (Free Text)
|Included in many <DatabaseEntry> elements. The format and meaning of the free text in the value attribute varies depending on the value of the <NameValueType> with type "Affy:Annotation_Category" in the same <DatabaseEntry>.|
value: (A Gene Symbol)
|Included in some <DatabaseEntry> elements for the database "DB:locus". The value represents the Gene Symbol, according to the LocusLink entry. Not all probe sets have a gene symbol associated with them, but for those that do, it always comes from a LocusLink reference.|
The <OntologyEntries_assnlist> element
The <OntologyEntries_assnlist> contains <OntologyEntry> elements. This is an optional element. Ontology Entries are reserved for annotations that come from a controlled set of vocabulary terms. Since ontology annotations are less common than database references, this element is absent for many Probe Sets.
The following example includes entries for many ontology types. Most Probe Sets have fewer ontology entries than this example, and some have none at all. Note that there can be multiple entries for any category.
description="protein biosynthesis (not recorded)"/>
description="soluble fraction (predicted/computed)"/>
value="Alanine and aspartate metabolism"/>
KEGG pathway annotations are unique in that they can be annotated either as Ontology Entries or Database Entries. Since each method has advantages and disadvantages, we currently make use of both. (Database Entries can hold URLs, but are not designed to include descriptions. Ontology Entries contain descriptions, but not accession numbers or URLs.) The two KEGG pathways for the example Probe Set above are also annotated in the following way:
The <PolymerType_assn> element
This required element has the following invariant form for all Probe Sets in all of our expression arrays:
<OntologyEntry category="polymertype" value="RNA"/>
The <Type_assn> element
This required element has the following form for all Probe Sets in all of our expression arrays:
<OntologyEntry category="biosequence:type" value="mRNA"/>
The value attribute may be "consensus mRNA", "exemplar mRNA", or simply "mRNA".
The <Species_assn> element
The <Species_assn> element is optional. Since most of our chips are designed with probes for a single species, we usually omit this element to reduce redundancy. When present, it has the following form:
<OntologyEntry category="species" value="Homo sapiens" />