home login register your profile contact        
Affymetrix
Products Support Analysis Scientific Community Corporate Careers Shop Affymetrix Japan
BY PRODUCT
Affymetrix Support - GeneChip Arrays GeneChip Arrays
Affymetrix Support - Assays and Reagents Assays & Reagents
Affymetrix Support - Instruments Instruments
Affymetrix Support - Software Software
BY SUPPORT TYPE
Affymetrix Support - Technical 
            Documentation Technical Documentation
Affymetrix Support - Application Notes Application Notes
Affymetrix Support - Product Brochures Brochures
Affymetrix Support - Product Data Sheets Data Sheets
Affymetrix Support - Frequently Asked Questions Frequently Asked Questions
Affymetrix Support - Manuals Manuals
Affymetrix Support - Material Safety Datasheets Material Safety Data Sheets
Affymetrix Support - Package Inserts Package Inserts
Affymetrix Support - Quick Reference Cards Quick Reference Cards
Affymetrix Support - Technical Notes Technical Notes
Affymetrix Support - Tutorials Tutorials
Affymetrix Support - White Papers White Papers
Affymetrix Support - Sample Data Data Resource Center
Affymetrix Support - Assay Panel Files Assay Panel Files
Affymetrix Support - NetAffx Annotation Files Annotation Files
Affymetrix Support - Library Files Library Files
Affymetrix Support - Sample Data Software Downloads
Affymetrix Support - Fluidics Scripts Fluidics Scripts
Affymetrix Support - Mask Files Mask Files
Affymetrix Support - Array Comparisons Array Comparisons
Affymetrix Support - Product Updates Product Updates
Affymetrix Support - Affymetrix Software Developer's Network Developers' Network
Affymetrix Support - GeneChip Compatible Partners - Software GeneChip Compatible Software
Affymetrix Support - Third Party Tools - Supported by Affymetrix Affymetrix Tools
Affymetrix Learning Center - Online Training LEARNING CENTER
Learning Center, Train on Affymetrix Tools and Instruments Learning Center Overview
Learning Center, Command Console Software Series Command Console®
Learning Center, Newark NJ - Data Analysis Workshops Data Analysis Workshops
Learning Center, CNAT 4.0 Overview BAT 2.0 Overview
Learning Center, CNAT 4.0 Overview CNAT 4.0 Overview
Learning Center, Genotyping Console Software Series Genotyping Console®
Learning Center, Genotyping Console Software Series NetAffx® Learning Center
Learning Center, GTYPE 4.1 Software Overview GTYPE 4.1 Overview
Learning Center, GTYPE 4.1 Software Overview Mapping 500k Assay
Learning Center, GTYPE 4.1 Software Overview WT Assay Tutorial
Tiling Analysis Software Tutorial Tiling Analysis Software Tutorial
Learning Center, Expression Data Analysis Series Expression Data
Analysis Series
SERVICE SUPPORT
Ordering Information
Affymetrix Support - Instument Installation Instrument Installation
Service Contracts
Affymetrix Services - List of Service Providers Service Providers
Affymetrix Services - Email Technical Support E-mail Technical Support
Affymetrix Services - FTP Secure File Exchange Secure File Exchange
Probe Set Data in MAGE-ML Format
We provide annotation data for our GeneChip® catalog arrays in multiple formats. This document describes how we have encoded our array annotations in the MAGE-ML format in files available on the support pages for each array.

Understanding the MAGE-ML Probe Set Files

This document describes how we have encoded annotations for probe sets in our GeneChip catalog arrays in downloadable XML files using the MAGE-ML format. Before reading this document, you may wish to familiarize yourself with more general information about the MAGE-ML format

Note that to use the XML files, you generally need a copy of the MAGE-ML.dtd file, available at the MAGE-ML support page of the MGED Society (Microarray Gene Expression Data Society). Our XML files assume this DTD file will be located in the same directory as the XML files.

The MAGE-ML format allows a great deal of flexibility in how some types of data can be encoded. In order to maximize compatibility with files generated by other groups, we have chosen to follow the recommendations of the European Bioinformatics Institute with respect to these, and other, elements:

  • <BioSequence> identifiers
  • <Database_ref> identifiers
  • <OntologyEntry> categories and values
  • Species names
For annotations that are unique to Affymetrix, or for which standards have not yet been developed, we use only the <NameValueType> XML element, with each name attribute beginning with "Affy:".

The structure of each MAGE-ML annotation file follows the following general outline:

  • XML Declarations
  • The <MAGE-ML> element containing:
    • A <PropertySets_assnlist> element
    • A <AuditTrail_assnlist> element
    • A <Description_package> element
    • A <BioSequence_package> element containing:
      • A <BioSequence_assnlist> containing one <BioSequence> element for each Probe Set

XML Declarations  

The first two lines of each file will always look like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE MAGE-ML SYSTEM "MAGE-ML.dtd">
These lines simply identify the content as a MAGE-ML file. They indicate that you will generally need a file called MAGE-ML.dtd which should be placed in the same directory as the MAGE-ML files. See general information about the MAGE-ML format for information on how to download the MAGE-ML.dtd file.

The <MAGE-ML> element  

The next line will be the <MAGE-ML> element. The identifier attribute identifies which Affymetrix® GeneChip probe array is being annotated. The name of the chip is always pre-pended with "Affy:Transcript:". There is always one MAGE-ML file for each probe array, rather than one file for each Chip Set.

<MAGE-ML identifier="Affy:Transcript:HG-U133A">

The <PropertySets_assnlist> element  

The <PropertySets_assnlist> is an optional element at this location in MAGE-ML. We use it to hold <NameValueType> elements that describe the probe array in more detail.

<PropertySets_assnlist>
  <NameValueType name="Affy:Chip_Set" 
    value="Affymetrix Human Genome HG-U133 Chip Set"/>
  <NameValueType name="Affy:Chip" 
    value="Affymetrix Human Genome U133A Array"/>
  <NameValueType name="Affy:Chip_Species" 
    value="Homo sapiens"/>
</PropertySets_assnlist>

The <AuditTrail_assnlist> element  

The <AuditTrail_assnlist> contains an <Audit> element. The date attribute of the <Audit> element describes the date when this set of annotations was generated, in year-month-day format. This is not necessarily the same as the date when the MAGE-ML file itself was generated. Annotations are updated on a quarterly basis. The action attribute is invariably "modification".

<AuditTrail_assnlist>
  <Audit date="2002-10-18" action="modification" />
</AuditTrail_assnlist>

The <Description_package> element  

The <Description_package> lists the unique identifiers of all the databases that may be referenced later in the file. (The list is un-ordered and may include more databases than are actually referenced later in the file.) The following example is much shorter than the list included in any of our annotation files.

<Description_package>
  <Database_assnlist>
    <Database identifier="DB:netaffx" 
      URI="http://www.affymetrix.com/analysis/"/>
    <Database identifier="DB:embl" 
      URI="http://www.ebi.ac.uk/embl/"/>
    <Database identifier="DB:genbank" 
      URI="http://www.ncbi.nlm.nih.gov/Genbank/index.html"/>
    <Database identifier="DB:refseq" 
      URI="http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html"/>
  </Database_assnlist>
</Description_package>

The <Biosequence_package> element  

The <Biosequence_package> element contains the bulk of the annotation data in each MAGE-ML file. It consists of a single <Biosequence_assnlist> element, which contains a series of <BioSequence> elements. There is one <BioSequence> for each probe set on the probe array, and these may occur in any order.

The <BioSequence> element  

Each <BioSequence> tag contains three attributes: identifier, name, and sequence.

  • The identifier attribute is required. It will take the form identifier="Affy:Transcript:HG-U133A:1053_at". In this example, "Affy:Transcript:HG-U133A" identifies the array, and will always match the identifier attribute of the enclosing <MAGE-ML> element, and "1053_at" is the Probe Set ID.
  • The name attribute is optional. When present, it contains the Title of the Probe Set. Not every Probe Set has a Title.
  • The sequence attribute is optional. When present, it contains the sequence from which individual probe sequences were chosen. (The sequence can be an "Exemplar," "Consensus," or "Control" sequence, described in a <NameValueType> below.) At present we include sequence data with each probe set. But since this information can be transmitted more compactly in other file formats, we may remove the sequence data in the future.
 <BioSequence identifier="Affy:Transcript:HG-U133A:1053_at" 
      name="replication factor C (activator 1) 2, 40kDa" 
      sequence="caatttgagtttccat....">

Each <BioSequence> element contains the following elements:
  • A <PropertySets_assnlist>
  • An optional <SequenceDatabases_assnlist>
  • An optional <OntologyEntries_assnlist>
  • A <PolymerType_assn>
  • A <Type_assn>
  • An optional <Species_assn>

The <PropertySets_assnlist> element  

The <PropertySets_assnlist> contains mostly Affymetrix-specific annotations. This is optional from the MAGE-ML standpoint, but is always present in our <BioSequence> elements.

The following example (from "Affy:Transcript:HG-U133A:200048_s_at") illustrates the possible types of annotations that may be included in this list. Not every type of annotation is present for every Probe Set.

<PropertySets_assnlist>
  <NameValueType name="Affy:Sequence_ID" value="g5729888"/>
  <NameValueType name="Affy:Transcript_ID" value="Hs.6396.0"/>
  <NameValueType name="Affy:Sequence_Type" value="Exemplar"/>
  <NameValueType name="Affy:Group_ID" 
    value="1500074354 (3 transcripts, 2 gene clusters)"/>
  <NameValueType name="Affy:Chromosomal_Location" value="1q21"/>
</PropertySets_assnlist>

The <SequenceDatabases_assnlist> element  

The <SequenceDatabases_assnlist> contains database references in <DatabaseEntry> elements. Although optional, the <SequenceDatabases_assnlist> element is rarely absent.

The <DatabaseEntry> element  

Each <DatabaseEntry> element contains a single reference to an entry in a database. In its simplest form, it contains a database identifier, an accession number, and an optional URI (or URL).

<DatabaseEntry accession="d1b8aa2" 
  URI="http://scop.berkeley.edu/search.cgi?key=d1b8aa2">
  <Database_assnref>
    <Database_ref identifier="DB:scop"/>
  </Database_assnref>
</DatabaseEntry>
In general, we will not include URLs that point to web sites other than those owned by Affymetrix. Instead, instructions on linking to other databases are provided in the manual "Probe Set Data in Tabular Format".

We often need to provide additional information in <DatabaseEntry> elements. This is done through the use of a <PropertySets_assnlist>. The most common additional information we add is a Description string. But we also often indicate which type of annotation is being described by providing a value for Affy:Annotation_Category. For example, there may be multiple references to the RefSeq database in a probe set. But they may mean very different things:

<DatabaseEntry accession="NM_004539">
  <PropertySets_assnlist>
    <NameValueType name="Affy:Annotation_Category" 
    value="Full-Length Reference Sequence"/>
  </PropertySets_assnlist>
  <Database_assnref>
    <Database_ref identifier="DB:refseq"/>
  </Database_assnref>
</DatabaseEntry>
<DatabaseEntry accession="NM_004539.2">
  <PropertySets_assnlist>
    <NameValueType name="Affy:Annotation_Category" 
      value="Source"/>
    <NameValueType name="Description" 
      value="gb:NM_004539.2 /DEF=Homo sapiens asparaginyl-tRNA 
      synthetase (NARS), mRNA. /FEA=mRNA /GEN=NARS 
      /PROD=asparaginyl-tRNA synthetase /DB_XREF=gi:7262387 
      /UG=Hs.181311 asparaginyl-tRNA synthetase 
      /FL=gb:BC001687.1 gb:D84273.1 gb:NM_004539.2"/>
  </PropertySets_assnlist>
  <Database_assnref>
    <Database_ref identifier="DB:refseq"/>
  </Database_assnref>
</DatabaseEntry>
In this example, the RefSeq accession number "NM_004539.2" is indicated as the source of the sequence data, and the RefSeq accession number "NM_004539" is indicated as being a "Full-Length Reference Sequence." This means that our probe set sequence might be based on information from both records, but the one indicated as the "Source" is the best representative sequence and is the accession number we use in retrieving annotation data from LocusLink.

The following table describes the sorts of data we currently include in <NameValueType> elements inside of <DatabaseEntry> elements. This represents our attempt to include all the annotations on the NetAffx™ Analysis Center web site in the MAGE-ML files, although some such annotations do not fit well into any more conventional places in MAGE-ML. Future improvements in our MAGE-ML files could involve changes to these annotations.
Attributes Notes
name: Affy:Annotation_Category
value: Source
Indicates a representative database reference from which the sequence of the Probe Set was based. These annotations are usually accompanied by a <NameValueType> with type "Description."
name: Affy:Annotation_Category
value: Full-Length Reference Sequence
References to accession numbers in the UniGene cluster represented by the probe set in addition to the one that is indicated as the sequence source.
name: Affy:Annotation_Category
value: Probe Set ID
We include one or more reference to the database "DB:netaffx" for each probe set. Only one of these will be a reference to the annotation of the probe set itself. (The others, if any, will be HomoloGene references.) The accession number is already included in the <BioSequence> element's identifier attribute, but by also including a <DatabaseEntry>, we can include a complete URL linking to the full record on our web site.
name: Affy:Annotation_Category
value: HomoloGene
type: (Species and Array Name)
HomoloGene references are always references to the database "DB:netaffx". The type attribute indicates which array is being referenced. In those cases where the same probe set exists on multiple arrays with the same name (for example 204780_s_at exists on both HG-Focus and HG-U133A), there can be multiple <NameValueType> tags with value of "HomoloGene" in a single <DatabaseEntry> element.
name: Affy:Annotation_Category
value: Protein Families
This simply indicates which portion of the annotation record generates the data. This is often used with entries from the database "DB:ec". For each record, there can be at most one reference to "DB:ec" that is not indicated as being a "Protein Families" annotation. There can be zero, one, or many references to "DB:ec" that are "Protein Families" annotations.
name: Affy:Annotation_Category
value: Protein Similarities
type: BLAST (NR)
These annotations are usually accompanied by a <NameValueType> with type "Description."
name: Description
value: (Free Text)
Included in many <DatabaseEntry> elements. The format and meaning of the free text in the value attribute varies depending on the value of the <NameValueType> with type "Affy:Annotation_Category" in the same <DatabaseEntry>.
name: GeneSymbol
value: (A Gene Symbol)
Included in some <DatabaseEntry> elements for the database "DB:locus". The value represents the Gene Symbol, according to the LocusLink entry. Not all probe sets have a gene symbol associated with them, but for those that do, it always comes from a LocusLink reference.

The <OntologyEntries_assnlist> element  

The <OntologyEntries_assnlist> contains <OntologyEntry> elements. This is an optional element. Ontology Entries are reserved for annotations that come from a controlled set of vocabulary terms. Since ontology annotations are less common than database references, this element is absent for many Probe Sets.

The following example includes entries for many ontology types. Most Probe Sets have fewer ontology entries than this example, and some have none at all. Note that there can be multiple entries for any category.

<OntologyEntries_assnlist>
  <OntologyEntry category="GO:Biological_Process" 
    value="GO:6412" 
    description="protein biosynthesis (not recorded)"/>
  <OntologyEntry category="GO:Cellular_Component" 
    value="GO:5625" 
    description="soluble fraction (predicted/computed)"/>
  <OntologyEntry category="GO:Cellular_Component" 
    value="GO:5737" 
    description="cytoplasm (predicted/computed)"/>
  <OntologyEntry category="GO:Molecular_Function" 
    value="GO:4816" 
    description="asparagine--tRNA ligase 
    (experimental evidence)"/>
  <OntologyEntry category="Pathway:GenMapp" 
    value="Alanine-Aspartate Metabolism"/>
  <OntologyEntry category="Pathway:GenMapp" 
    value="Protein Folding-Secretion"/>
  <OntologyEntry category="Pathway:KEGG" 
    value="Alanine and aspartate metabolism"/>
  <OntologyEntry category="Pathway:KEGG" 
    value="Aminoacyl-tRNA biosynthesis"/>
</OntologyEntries_assnlist>

KEGG pathway annotations are unique in that they can be annotated either as Ontology Entries or Database Entries. Since each method has advantages and disadvantages, we currently make use of both. (Database Entries can hold URLs, but are not designed to include descriptions. Ontology Entries contain descriptions, but not accession numbers or URLs.) The two KEGG pathways for the example Probe Set above are also annotated in the following way:

<DatabaseEntry accession="MAP00252">
  <Database_assnref>
    <Database_ref identifier="DB:kegg"/>
  </Database_assnref>
</DatabaseEntry>
<DatabaseEntry accession="MAP00970">
  <Database_assnref>
    <Database_ref identifier="DB:kegg"/>
  </Database_assnref>
</DatabaseEntry>

The <PolymerType_assn> element  

This required element has the following invariant form for all Probe Sets in all of our expression arrays:

<PolymerType_assn>
  <OntologyEntry category="polymertype" value="RNA"/>
</PolymerType_assn>

The <Type_assn> element  

This required element has the following form for all Probe Sets in all of our expression arrays:

<Type_assn>
  <OntologyEntry category="biosequence:type" value="mRNA"/>
</Type_assn>
The value attribute may be "consensus mRNA", "exemplar mRNA", or simply "mRNA".

The <Species_assn> element  

The <Species_assn> element is optional. Since most of our chips are designed with probes for a single species, we usually omit this element to reduce redundancy. When present, it has the following form:

<Species_assn>
  <OntologyEntry category="species" value="Homo sapiens" />
</Species_assn>

TECHNICAL SUPPORT
  United States / Canada
888-DNA-CHIP
(888-362-2447)
e-mail technical support
  Europe
+44 (0) 1628 552550
e-mail technical support
  Japan
+81 3-5730-8222
e-mail technical support
POPULAR DOWNLOADS
Brochure, The GeneChip® System: An Integrated Solution For Expression and DNA Analysis (pdf, 227 KB)
Brochure, RNA Expression Analysis with the GeneChip® System (pdf, 1.3 MB)
Data Sheet, Human Genome Arrays (pdf, 169 KB)
Manual, Expression Analysis Technical Manual
Manual, Data Analysis Fundamentals (pdf, 723 KB)
888-DNA-CHIP (888-362-2447) +44 (0) 1628 552550 feedback e-mail support terms of use privacy policy