 |
 |
The annotation data available in the NetAffx Analysis Center
for GeneChip® arrays are also available in a comma-separated-values
tabular format. Annotation files for each array will be updated
when the data on the NetAffx Analysis Center are updated for that
array, generally on a quarterly basis, using the method described
here. The annotation date is included
as a field inside each file. (Sequence data is constant, and does not need
to be updated when annotations are updated.)
Annotation data, Sequence data, BLAST results,
and lists of between-chip orthologs are available
in separate files. To find support files,
select the chip you are interested in from the list on the
support
by product page.
The files are provided in "ZIP" compressed format.
Although we have made every effort to make these files as complete
and accurate as possible, we do not guarantee that they are free
from errors or omissions. In particular, errors in data received
from outside sources are beyond our control. Please refer to our
Terms and Conditions
for details on the acceptable use of these data.
The format of the files may change
somewhat whenever we update our annotation data or choose to include
additional forms of annotations. Refer to the release notes below.
|
 |
 |
 |
 |
 |
 |
 |
 |
 |
Minor changes occurred with the October 2004 release due to changes in annotation methods.
| |
 |
Removed columns: Overlapping Transcripts |
| |
 |
Altered columns: Alignments, Annotation Transcript Cluster, Annotation Description, Trans Membrane |
| |
 |
New columns: Transcript Assignments, Annotation Notes |
|
 |
 |
 |
 |
 |
 |
Significant changes occurred with the March 2004 release, please refer to the column descriptions below.
|
 |
 |
 |
 |
 |
 |
Beginning with the October 2003 release,
BLAST and Ortholog/Homolog
annotations are being provided in separate files. Proteome
BioKnowledge® Library data will no longer be provided (due
to licensing issues).
The following columns have been removed:
| |
 |
Protein Similarities BLASTP (GenBank NR) |
| |
 |
Protein Similarities BLASTX (SwissProt/TrEMBL) |
| |
 |
Orthologs/Homologs |
| |
 |
All 5 Proteome columns |
|
 |
 |
 |
 |
 |
 |
In the June 2003 release
includes the following additions:
| |
 |
Annotations for control probe
sets are now included. |
| |
 |
Cluster type information is now included
along with the Unigene accession number |
| |
 |
An extra data column has been added at the end of each row for
Quantitative Trait Loci (QTL).
The QTL column contains useful data for rat probe sets. |
| |
 |
The format of the TMM data has been updated to contain more information. |
|
 |
 |
 |
 |
 |
 |
The files are available in a comma-separated-values (CSV) format.
These are plain-text files with each row terminated by a new-line
character. Data in separate fields are enclosed in quotation marks
and separated by commas. None of the data fields contains any of
these characters: quotation mark, new-line, carriage return, or
tab.
We expect these files to be used primarily in spreadsheet applications
and database programs (such as SQL databases). We have tried to
format the data in such a way to make both of these uses relatively
easy. Note that some of the files, and the data fields in them
are large.
There is one file per GeneChip array. Thus there are separate
files for HG-U133A and HG-U133B, rather than one file for the Human Genome
U133 Set.
The first row of each file contains the titles of the fields contained
in the subsequent rows.
Each row after the first row contains annotations for a single
probe set. All annotations for that probe set are contained in that
single row. In some fields, such as the protein domain annotations,
there can be more than one annotation for a single probe set. In
this case, the multiple values are separated by the string " ///
".
In many types of annotations, sub-fields are separated by " //
". For example, an annotation for a "GO Biological Process" might
appear as "7155 // cell adhesion // predicted/computed". In this
case, the sections correspond to "ID // Description // Evidence",
but the meaning of the sub-fields varies between different types
of annotation, as described below.
Empty fields are indicated by "---". The point of using such a
string rather than leaving the field empty is that it makes the
columnar nature of the data more clearly visible in certain spreadsheet
programs.
Some columns in some files contain no data. To help users merge
data from multiple files, such empty columns are not removed. Thus
each file has the same columns in the same order.
Some fields, such as "Chip," contain the same value for every
probe set in a file. Although these data are redundant within any
individual file, they are useful to users who merge data from multiple
files.
|
 |
 |
 |
 |
 |
 |
In the following sections, we describe the content of each field
of the data files. The fields are of four types:
When possible, we provide instructions on how to create deep links
to data from various public databases. Take care to read the terms
and conditions of a site before using deep links to it. Also, please
be aware that the location of and availability of these databases
may change from time to time.
|
 |
 |
 |
 |
 |
 |
 |
 |
 |
The probe set identifier.
Examples: 200007_at, 200011_s_at, 200012_x_at.
You may create deep links to probe set information in the NetAffx
Analysis Center, subject to our Terms
and Conditions. Usually it will be sufficient to simply supply
the Probe Set ID(s), but sometimes you also may want to indicate
the array. Deep links are fully explained in the Direct
Access to Probe Set Information manual.
Deep Link Example (you will be asked to log in):
https://www.affymetrix.com/LinkServlet?probeset=10156_at
|
 |
 |
 |
 |
 |
 |
The GeneChip probe array name.
|
 |
 |
 |
 |
 |
 |
The genus and species
of the organism represented by the probe set. |
 |
 |
 |
 |
 |
 |
The date that the annotations for this probe array were last updated.
It will generally be earlier than the date when the annotations
were posted on the web site.
|
 |
 |
 |
 |
 |
 |
 |
 |
 |
Indicates whether the
sequence is an Exemplar, Consensus or Control sequence. An Exemplar
is a single nucleotide sequence taken directly from a public database.
This sequence could be an mRNA or EST. A Consensus sequence,
is a nucleotide sequence assembled by Affymetrix, based on one or
more sequence taken from a public database. |
 |
 |
 |
 |
 |
 |
The database from which the sequence used to design this probe
set was taken.
Examples (including deep links, when available):
Refer to the terms and conditions of each site before using
such deep links.
|
 |
 |
 |
 |
 |
 |
The accession number of a representative sequence.
Note that for consensus-based probe sets, the representative
sequence is only one of several sequences (sequence sub-clusters)
used to build the consensus sequence and it is not directly used
to derive the probe sequences.
The representative sequence is chosen during array design as a
sequence that is best associated with the transcribed region
being interrogated by the probe set.
Refer to the "Sequence Source" field to determine the
database used.
(The name of this field may be written as "Sequence Derived From".)
|
 |
 |
 |
 |
 |
 |
GenBank
description associated with the representative public identifier.
Blank for some probe sets. |
 |
 |
 |
 |
 |
 |
Cluster identification
number with a sub-cluster identifier appended. Currently provided
only for HG-U133, HG-Focus, and newer arrays. |
 |
 |
 |
 |
 |
 |
Identifier associated with the group of sequences that are detected
by a probe set. The identifier is a number assigned by Affymetrix
and does not refer to any external database. For HG-U133, HG-Focus,
and newer arrays, the number of UniGene
clusters detected by the probe set is also provided.
Example: "1500074332 (2 transcripts, 2 gene clusters)"
Beginning with the arrays released in 2003, we are providing more
information in this field. Instead of an Affymetrix ID number, we
list all the UniGene transcripts (group members) that are detected
by the probe set.
Examples: "Mm.21841.1:Mm.21841.2:Mm.190631.1:Mm.21841.4:Mm.218761.1
(5 transcripts, 3 gene clusters)"
|
 |
 |
 |
 |
 |
 |
UniGene cluster ID, curated at the time of array design, of the group of sequences used to build the consensus sequence. This is static information and never gets updated. However, since UniGene clusters are frequently retired or split into other clusters, we update the UniGene cluster ID for each probe set every quarter and this latest UniGene ID is provided as a different field (called "UniGene ID") in the download files and NetAffx annotation reports. |
 |
 |
 |
 |
 |
 |
Most of the data in this
section come from LocusLink and UniGene,
and are annotations of the reference sequence on which the probe set
is modeled. |
 |
 |
 |
Title of the reference
sequence. Blank in many cases. |
 |
 |
 |
 |
 |
 |
Title of Gene represented by the probe set. |
 |
 |
 |
 |
 |
 |
A gene symbol, when one is available. Such symbols are assigned
by different organizations for different species. Our data come
from the UniGene
record. We do not attempt to indicate which species-specific databank
was used, but some of the possibilities include:
|
 |
 |
 |
 |
 |
 |
Chromosomal location.
We add the prefix "Chr:" to avoid problems with some spreadsheet applications
interpreting some locations as Dates. |
 |
 |
 |
 |
 |
 |
UniGene
accession number.
Example Deep Link:
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Hs&CID=55682
|
 |
 |
 |
 |
 |
 |
LocusLink
accession number. |
 |
 |
 |
 |
 |
 |
SWISS-PROT
(sometimes known as SWALL) accession numbers. SWISS-PROT entries
also have an ID, in addition to an accession number. (For example,
the accession number "P11474" has the ID "ERR1_HUMAN".) You can
use the same deep-link URL to link to entries based on either an
accession number or an ID.
Example: "P11474 /// Q8N4S8 /// Q96F89 /// Q96I02"
Example Deep Links:
http://us.expasy.org/cgi-bin/get-sprot-entry?P11474
http://us.expasy.org/cgi-bin/get-sprot-entry?ERR1_HUMAN
|
 |
 |
 |
 |
 |
 |
Enzyme
Commission family number.
Example Deep Links:
http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/4/24/25.html
http://www.expasy.ch/cgi-bin/nicezyme.pl?3.4.24.25
Note: Proper EC numbers have four levels. Sometimes we cannot
provide an annotation beyond the third level, resulting in a number
like "EC:3.4.24" (or possibly "EC:3.4.24.-"). You can look up data
on this sort of number with a URL like this: http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/4/24/
|
 |
 |
 |
 |
 |
 |
OMIM:
Online Mendelian Inheritance in Man accession number.
Example Deep Link:
http://www3.ncbi.nlm.nih.gov/htbin-post/Omim/dispmim?603167
|
 |
 |
 |
 |
 |
 |
An ID from the Ensembl project.
Example Deep Link:
http://www.ensembl.org/Mus_musculus/geneview?gene=ENSMUSG00000032603 |
 |
 |
 |
 |
 |
 |
References to a variety of other databases appear in separate columns.
|
 |
 |
 |
 |
 |
 |
References to multiple sequences in RefSeq. The field
contains the ID and Description for each entry, and there can be multiple
entries per ProbeSet.
Example: "NM_002662 // phospholipase D1, phophatidylcholine-specific"
Example Deep Link:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_002662
|
 |
 |
 |
 |
 |
 |
ID of the protein sequence in the NCBI RefSeq database. |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Data referring to the Gene
Ontology Consortium.
See also AmiGO
and QuickGO.
The following types of annotation appear in separate columns:
| |
 |
Biological Process (GO) |
| |
 |
Cellular Component (GO) |
| |
 |
Molecular Function (GO) |
Each annotation consists of three parts: "Accession Number //
Description // Evidence". The description corresponds directly to
the GO ID. The evidence can be "direct", or "extended".
"Direct evidence" shows the type of evidence provided by the GO
Consortium for their curation of a relationship between the LocusLink
accession and the GO Ontology term.
Example: "7155 // cell adhesion // predicted/computed"
"Extended evidence" indicates that the Ontology term was curated
by Affymetrix Inc. based on similarity with genes that are annotated
by the GO Consortium. Data in this field relates the GO term to
a PFam or EC number related to the reference sequence. In the first
example below the enzyme with EC number "3.4.21.79" aligns to the
reference sequence with an E-Value = "5.38e-88", and the relationship
between that enzyme and the GO ID "4278" is "inferred from electronic
annotation". In the second example the PFam accession ID "EMP24_GP25L"
aligns to the reference sequence with an E-Value of "2e-57", and
the relationship between that PFam accession ID and GO ID "8320"
is "Unknown".
Example: "4278 // granzyme B // extended:inferred from electronic
annotation; 3.4.21.79; 5.38e-88 /// 8320 // protein carrier // extended:Unknown;
EMP24_GP25L; 2e-57"
Example Deep Link:
http://godatabase.org/cgi-bin/go.cgi?query=7155&view=details&depth=0
|
 |
 |
 |
 |
 |
 |
References to Gene
MicroArray Pathway Profiler and KEGG: Kyoto Encyclopedia
of Genes and Genomes.
|
 |
 |
 |
 |
 |
 |
InterPro
accession number and description.
Example: "IPR003594 // ATP-binding protein, ATPase-like"
Example Deep Link:
http://www.ebi.ac.uk/interpro/IEntry?ac=IPR003594
|
 |
 |
 |
 |
 |
 |
Pfam and SCOP Annotations
Example Deep Links:
http://pfam.wustl.edu/cgi-bin/getdesc?name=p450
http://scop.berkeley.edu/search.cgi?key=d1d8db_
|
 |
 |
 |
 |
 |
 |
A Transmembrane Helix Predictions are generated using the TMHMM program method as filtered by InterPro. See http://www.cbs.dtu.dk/~krogh/TMHMM/
The TMM annotations for each probe set are of the format:
Example: "NP_054700.1 // span:417-439 // numtm:1"
Format: "ID// span:(domain boundaries)// numtm:(Number of Domains)"
|
 |
 |
 |
 |
 |
 |
References to entries in the various databases, indicated by namespace
prefixes. In each case, our annotations include an accession number
and a description. Each description may end with an E-value.
"GPCR:" indicates entries from GPCRDB:
Information system for G protein-coupled receptors (GPCRs).
"EC:" indicates an EC Number.
Those entries include a symbol, then the EC Number itself, then
a description. The first word in the description is a Sw |