|
This document describes how we have encoded annotations for probe sets in
our GeneChip catalog arrays in
downloadable XML files using the MAGE-ML format. Before reading this
document, you may wish to familiarize yourself with more general
information about the MAGE-ML format
Note that to use the XML files, you generally need a copy of the
MAGE-ML.dtd file, available at the
MAGE-ML
support page of the MGED Society
(Microarray Gene Expression Data Society).
Our XML files assume this DTD file will be located in
the same directory as the XML files.
The MAGE-ML format allows a great deal of flexibility in how some types of
data can be encoded. In order to maximize compatibility with files generated
by other groups, we have chosen to follow the recommendations of the
European Bioinformatics Institute with respect to these, and other,
elements:
- <BioSequence> identifiers
- <Database_ref> identifiers
- <OntologyEntry> categories and values
- Species names
For annotations that are unique to Affymetrix, or for which standards have not
yet been developed, we use only the <NameValueType> XML element, with
each name attribute beginning with "Affy:".
The structure of each MAGE-ML annotation file follows the following
general outline:
- XML Declarations
- The <MAGE-ML> element containing:
- A <PropertySets_assnlist> element
- A <AuditTrail_assnlist> element
- A <Description_package> element
- A <BioSequence_package> element containing:
- A <BioSequence_assnlist>
containing one <BioSequence> element
for each Probe Set
The first two lines of each file will always look like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE MAGE-ML SYSTEM "MAGE-ML.dtd">
These lines simply identify the content as a MAGE-ML file. They indicate that
you will generally need a file called MAGE-ML.dtd which should be
placed in the same directory as the MAGE-ML files.
See general
information about the MAGE-ML format for information on how to download the MAGE-ML.dtd file.
The next line will be the <MAGE-ML> element. The identifier
attribute identifies which Affymetrix® GeneChip probe array is being
annotated. The name of the chip is always pre-pended with "Affy:Transcript:".
There is always one MAGE-ML file for each probe array, rather than one
file for each Chip Set.
<MAGE-ML identifier="Affy:Transcript:HG-U133A">
The <PropertySets_assnlist> is an optional element at this location
in MAGE-ML. We use it to hold <NameValueType> elements that describe
the probe array in more detail.
<PropertySets_assnlist>
<NameValueType name="Affy:Chip_Set"
value="Affymetrix Human Genome HG-U133 Chip Set"/>
<NameValueType name="Affy:Chip"
value="Affymetrix Human Genome U133A Array"/>
<NameValueType name="Affy:Chip_Species"
value="Homo sapiens"/>
</PropertySets_assnlist>
The <AuditTrail_assnlist> contains an <Audit> element.
The date attribute of the
<Audit> element describes the date when this set of annotations
was generated, in year-month-day format.
This is not necessarily the same as the date when the MAGE-ML file
itself was generated. Annotations are updated on a quarterly basis.
The action attribute is invariably "modification".
<AuditTrail_assnlist>
<Audit date="2002-10-18" action="modification" />
</AuditTrail_assnlist>
The <Description_package> lists the unique identifiers of
all the databases that may be referenced later in the file.
(The list is un-ordered and may include more databases than
are actually referenced later in the file.)
The following example is much shorter than the list included in any of
our annotation files.
<Description_package>
<Database_assnlist>
<Database identifier="DB:netaffx"
URI="http://www.affymetrix.com/analysis/"/>
<Database identifier="DB:embl"
URI="http://www.ebi.ac.uk/embl/"/>
<Database identifier="DB:genbank"
URI="http://www.ncbi.nlm.nih.gov/Genbank/index.html"/>
<Database identifier="DB:refseq"
URI="http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html"/>
</Database_assnlist>
</Description_package>
The <Biosequence_package> element contains the bulk of the annotation
data in each MAGE-ML file. It consists of a single <Biosequence_assnlist>
element, which contains a series of <BioSequence> elements. There
is one <BioSequence> for each probe set on the probe array, and
these may occur in any order.
Each <BioSequence> tag contains three attributes: identifier,
name, and sequence.
- The identifier attribute is required. It will take the form
identifier="Affy:Transcript:HG-U133A:1053_at". In this example, "Affy:Transcript:HG-U133A"
identifies the array, and will always match the identifier
attribute of the enclosing <MAGE-ML> element, and "1053_at" is
the Probe Set ID.
- The name attribute is optional. When present, it contains
the Title of the Probe Set. Not every Probe Set has a Title.
- The sequence attribute is optional. When present, it contains
the sequence from which individual probe sequences were chosen. (The
sequence can be an "Exemplar," "Consensus," or "Control" sequence, described
in a <NameValueType> below.) At present we include sequence data
with each probe set. But since this information can be transmitted more
compactly in other file formats, we may remove the sequence data in
the future.
<BioSequence identifier="Affy:Transcript:HG-U133A:1053_at"
name="replication factor C (activator 1) 2, 40kDa"
sequence="caatttgagtttccat....">
Each <BioSequence> element contains the following elements:
- A <PropertySets_assnlist>
- An optional <SequenceDatabases_assnlist>
- An optional <OntologyEntries_assnlist>
- A <PolymerType_assn>
- A <Type_assn>
- An optional <Species_assn>
The <PropertySets_assnlist> contains mostly Affymetrix-specific annotations.
This is optional from the MAGE-ML standpoint, but is always present in our
<BioSequence> elements.
The following example (from "Affy:Transcript:HG-U133A:200048_s_at")
illustrates the possible types of annotations that may
be included in this list. Not every type of annotation is present for every Probe Set.
<PropertySets_assnlist>
<NameValueType name="Affy:Sequence_ID" value="g5729888"/>
<NameValueType name="Affy:Transcript_ID" value="Hs.6396.0"/>
<NameValueType name="Affy:Sequence_Type" value="Exemplar"/>
<NameValueType name="Affy:Group_ID"
value="1500074354 (3 transcripts, 2 gene clusters)"/>
<NameValueType name="Affy:Chromosomal_Location" value="1q21"/>
</PropertySets_assnlist>
The <SequenceDatabases_assnlist> contains database references
in <DatabaseEntry> elements. Although optional, the <SequenceDatabases_assnlist>
element is rarely absent.
Each <DatabaseEntry> element contains a single reference to an
entry in a database. In its simplest form, it contains a database identifier,
an accession number, and an optional URI (or URL).
<DatabaseEntry accession="d1b8aa2"
URI="http://scop.berkeley.edu/search.cgi?key=d1b8aa2">
<Database_assnref>
<Database_ref identifier="DB:scop"/>
</Database_assnref>
</DatabaseEntry>
In general, we will not include URLs that point to web sites other
than those owned by Affymetrix. Instead, instructions on linking to other
databases are provided in the manual "Probe
Set Data in Tabular Format".
We often need to provide additional information in <DatabaseEntry> elements.
This is done through the use of a <PropertySets_assnlist>. The most common
additional information we add is a Description string. But we also often
indicate which type of annotation is being described by providing a
value for Affy:Annotation_Category. For example, there may be multiple references
to the RefSeq database in a probe set. But they may mean very different things:
<DatabaseEntry accession="NM_004539">
<PropertySets_assnlist>
<NameValueType name="Affy:Annotation_Category"
value="Full-Length Reference Sequence"/>
</PropertySets_assnlist>
<Database_assnref>
<Database_ref identifier="DB:refseq"/>
</Database_assnref>
</DatabaseEntry>
<DatabaseEntry accession="NM_004539.2">
<PropertySets_assnlist>
<NameValueType name="Affy:Annotation_Category"
value="Source"/>
<NameValueType name="Description"
value="gb:NM_004539.2 /DEF=Homo sapiens asparaginyl-tRNA
synthetase (NARS), mRNA. /FEA=mRNA /GEN=NARS
/PROD=asparaginyl-tRNA synthetase /DB_XREF=gi:7262387
/UG=Hs.181311 asparaginyl-tRNA synthetase
/FL=gb:BC001687.1 gb:D84273.1 gb:NM_004539.2"/>
</PropertySets_assnlist>
<Database_assnref>
<Database_ref identifier="DB:refseq"/>
</Database_assnref>
</DatabaseEntry>
In this example, the RefSeq accession number "NM_004539.2" is indicated
as the source of the sequence data, and the RefSeq accession number "NM_004539"
is indicated as being a "Full-Length Reference Sequence." This means that
our probe set sequence might be based on information from both records,
but the one indicated as the "Source" is the best representative sequence
and is the accession number we use in retrieving annotation data from LocusLink.
The following table describes the sorts of data we currently include
in <NameValueType> elements inside of <DatabaseEntry> elements.
This represents our attempt to include all the annotations on the NetAffx
Analysis Center web site in the MAGE-ML files, although some such annotations
do not fit well into any more conventional places in MAGE-ML. Future improvements
in our MAGE-ML files could involve changes to these annotations.
name: Affy:Annotation_Category
value: Source |
Indicates a representative database reference from
which the sequence of the Probe Set was based. These annotations are
usually accompanied by a <NameValueType> with type
"Description." |
name: Affy:Annotation_Category
value: Full-Length Reference Sequence |
References to accession numbers in the
UniGene cluster represented by the probe set in addition to
the one that is indicated as the sequence source. |
name: Affy:Annotation_Category
value: Probe Set ID |
We include one or more reference to the database
"DB:netaffx" for each probe set. Only one of these will be a reference
to the annotation of the probe set itself. (The others, if any, will
be HomoloGene references.) The accession number is already included
in the <BioSequence> element's identifier attribute,
but by also including a <DatabaseEntry>, we can include a complete
URL linking to the full record on our web site. |
name: Affy:Annotation_Category
value: HomoloGene
type: (Species and Array Name) |
HomoloGene references are always references to the
database "DB:netaffx". The type attribute indicates which
array is being referenced. In those cases where the same probe set
exists on multiple arrays with the same name (for example 204780_s_at
exists on both HG-Focus and HG-U133A), there can
be multiple <NameValueType> tags with value of "HomoloGene"
in a single <DatabaseEntry> element. |
name: Affy:Annotation_Category
value: Protein Families |
This simply indicates which portion of the annotation
record generates the data. This is often used with entries from the
database "DB:ec". For each record, there can be at most one reference
to "DB:ec" that is not indicated as being a "Protein Families"
annotation. There can be zero, one, or many references to "DB:ec"
that are "Protein Families" annotations. |
name: Affy:Annotation_Category
value: Protein Similarities
type: BLAST (NR) |
These annotations are usually accompanied by a <NameValueType>
with type "Description." |
name: Description
value: (Free Text) |
Included in many <DatabaseEntry> elements.
The format and meaning of the free text in the value
attribute varies depending on the value of the
<NameValueType> with type "Affy:Annotation_Category"
in the same <DatabaseEntry>. |
name: GeneSymbol
value: (A Gene Symbol) |
Included in some <DatabaseEntry> elements
for the database "DB:locus".
The value represents the Gene Symbol, according to
the LocusLink entry. Not all probe sets have a gene symbol
associated with them, but for those that do, it always comes
from a LocusLink reference.
|
The <OntologyEntries_assnlist> contains <OntologyEntry>
elements. This is an optional element. Ontology Entries are reserved for
annotations that come from a controlled set of vocabulary terms. Since
ontology annotations are less common than database references, this element
is absent for many Probe Sets.
The following example includes entries for many ontology types.
Most Probe Sets have fewer ontology entries than this example,
and some have none at all. Note that there
can be multiple entries for any category.
<OntologyEntries_assnlist>
<OntologyEntry category="GO:Biological_Process"
value="GO:6412"
description="protein biosynthesis (not recorded)"/>
<OntologyEntry category="GO:Cellular_Component"
value="GO:5625"
description="soluble fraction (predicted/computed)"/>
<OntologyEntry category="GO:Cellular_Component"
value="GO:5737"
description="cytoplasm (predicted/computed)"/>
<OntologyEntry category="GO:Molecular_Function"
value="GO:4816"
description="asparagine--tRNA ligase
(experimental evidence)"/>
<OntologyEntry category="Pathway:GenMapp"
value="Alanine-Aspartate Metabolism"/>
<OntologyEntry category="Pathway:GenMapp"
value="Protein Folding-Secretion"/>
<OntologyEntry category="Pathway:KEGG"
value="Alanine and aspartate metabolism"/>
<OntologyEntry category="Pathway:KEGG"
value="Aminoacyl-tRNA biosynthesis"/>
</OntologyEntries_assnlist>
KEGG pathway annotations are unique in that they can be annotated either
as Ontology Entries or Database Entries. Since each method has advantages
and disadvantages, we currently make use of both. (Database Entries can
hold URLs, but are not designed to include descriptions. Ontology Entries
contain descriptions, but not accession numbers or URLs.) The two KEGG
pathways for the example Probe Set above are also annotated in the following
way:
<DatabaseEntry accession="MAP00252">
<Database_assnref>
<Database_ref identifier="DB:kegg"/>
</Database_assnref>
</DatabaseEntry>
<DatabaseEntry accession="MAP00970">
<Database_assnref>
<Database_ref identifier="DB:kegg"/>
</Database_assnref>
</DatabaseEntry>
This required element has the following invariant form
for all Probe Sets in all of our expression arrays:
<PolymerType_assn>
<OntologyEntry category="polymertype" value="RNA"/>
</PolymerType_assn>
This required element has the following form
for all Probe Sets in all of our expression arrays:
<Type_assn>
<OntologyEntry category="biosequence:type" value="mRNA"/>
</Type_assn>
The value attribute may be "consensus mRNA",
"exemplar mRNA", or simply "mRNA".
The <Species_assn> element is optional.
Since most of our chips are designed with probes for a single
species, we usually omit this element to reduce redundancy.
When present, it has the following form:
<Species_assn>
<OntologyEntry category="species" value="Homo sapiens" />
</Species_assn>
|