The annotation data available in the NetAffx Analysis Center for GeneChip® arrays are also available in a comma-separated-values tabular format. Annotation files for each array will be updated when the data on the NetAffx Analysis Center are updated for that array, generally on a quarterly basis, using the method described here. The annotation date is included as a field inside each file. (Sequence data is constant, and does not need to be updated when annotations are updated.)
Annotation data, Sequence data, BLAST results, and lists of between-chip orthologs are available in separate files. To find support files, select the chip you are interested in from the list on the support by product page.
The files are provided in "ZIP" compressed format.
Although we have made every effort to make these files as complete and accurate as possible, we do not guarantee that they are free from errors or omissions. In particular, errors in data received from outside sources are beyond our control. Please refer to our Terms and Conditions for details on the acceptable use of these data.
The format of the files may change somewhat whenever we update our annotation data or choose to include additional forms of annotations. Refer to the release notes below.
October 2004 Update
Minor changes occurred with the October 2004 release due to changes in annotation methods.
- Removed columns: Overlapping Transcripts
- Altered columns: Alignments, Annotation Transcript Cluster, Annotation Description, Trans Membrane
- New columns: Transcript Assignments, Annotation Notes
March 2004 Update
Significant changes occurred with the March 2004 release, please refer to the column descriptions below.
October 2003 Update
Beginning with the October 2003 release, BLAST and Ortholog/Homolog annotations are being provided in separate files. Proteome BioKnowledge® Library data will no longer be provided (due to licensing issues).
The following columns have been removed:
- Protein Similarities BLASTP (GenBank NR)
- Protein Similarities BLASTX (SwissProt/TrEMBL)
- All 5 Proteome columns
June 2003 Update
In the June 2003 release includes the following additions:
- Annotations for control probe sets are now included.
- Cluster type information is now included along with the Unigene accession number
- An extra data column has been added at the end of each row for Quantitative Trait Loci (QTL). The QTL column contains useful data for rat probe sets.
- The format of the TMM data has been updated to contain more information.
Understanding the Tabular Formatted Probe Set Files
The files are available in a comma-separated-values (CSV) format. These are plain-text files with each row terminated by a new-line character. Data in separate fields are enclosed in quotation marks and separated by commas. None of the data fields contains any of these characters: quotation mark, new-line, carriage return, or tab.
We expect these files to be used primarily in spreadsheet applications and database programs (such as SQL databases). We have tried to format the data in such a way to make both of these uses relatively easy. Note that some of the files, and the data fields in them are large.
There is one file per GeneChip array. Thus there are separate files for HG-U133A and HG-U133B, rather than one file for the Human Genome U133 Set.
The first row of each file contains the titles of the fields contained in the subsequent rows.
Each row after the first row contains annotations for a single probe set. All annotations for that probe set are contained in that single row. In some fields, such as the protein domain annotations, there can be more than one annotation for a single probe set. In this case, the multiple values are separated by the string " /// ".
In many types of annotations, sub-fields are separated by " // ". For example, an annotation for a "GO Biological Process" might appear as "7155 // cell adhesion // predicted/computed". In this case, the sections correspond to "ID // Description // Evidence", but the meaning of the sub-fields varies between different types of annotation, as described below.
Empty fields are indicated by "---". The point of using such a string rather than leaving the field empty is that it makes the columnar nature of the data more clearly visible in certain spreadsheet programs.
Some columns in some files contain no data. To help users merge data from multiple files, such empty columns are not removed. Thus each file has the same columns in the same order.
Some fields, such as "Chip," contain the same value for every probe set in a file. Although these data are redundant within any individual file, they are useful to users who merge data from multiple files.
In the following sections, we describe the content of each field of the data files. The fields are of four types:
- GeneChip Array Information
- Probe Design Information
- Public Domain and Genomic References
- Functional Annotations
When possible, we provide instructions on how to create deep links to data from various public databases. Take care to read the terms and conditions of a site before using deep links to it. Also, please be aware that the location of and availability of these databases may change from time to time.
GeneChip Array Information
Probe Set ID
The probe set identifier.
Examples: 200007_at, 200011_s_at, 200012_x_at.
You may create deep links to probe set information in the NetAffx Analysis Center, subject to our Terms and Conditions. Usually it will be sufficient to simply supply the Probe Set ID(s), but sometimes you also may want to indicate the array. Deep links are fully explained in the Direct Access to Probe Set Information manual.
Deep Link Example (you will be asked to log in):
The GeneChip probe array name.
Species Scientific Name
The genus and species of the organism represented by the probe set.
The date that the annotations for this probe array were last updated. It will generally be earlier than the date when the annotations were posted on the web site.
Probe Design Information
Indicates whether the sequence is an Exemplar, Consensus or Control sequence. An Exemplar is a single nucleotide sequence taken directly from a public database. This sequence could be an mRNA or EST. A Consensus sequence, is a nucleotide sequence assembled by Affymetrix, based on one or more sequence taken from a public database.
The database from which the sequence used to design this probe set was taken.
Examples (including deep links, when available): Refer to the terms and conditions of each site before using such deep links.
Example GenBank Deep Link URL:
Example RefSeq Deep Link URL:
Example Deep Link URL:
Annotations from The Institute for Genomic Research. These occur mainly on our mouse (Mus musculus) arrays.
Annotations beginning with "TC" refer to the TIGR Mouse Gene Index. Annotations beginning with "HT" (Human) or "ET" (other species) are sequence IDs from The Expressed Gene Anatomy Database (EGAD).
Annotations beginning with "HG" are gene IDs; use the sequence IDs for deep links.
Example Deep Link URL for TC numbers: http://www.tigr.org/tigr-scripts/tgi/tc_report.pl?species=1&tc=TC641394
Example Deep Links URLs for Sequence IDs:
Affymetrix Proprietary Database
Stanford Public Database (SGD: Saccharomyces Genome Database)
Blattner: University of Wisconsin at Madison, E. coli Genome Database. F. R. Blattner's lab.
The accession numbers are the same as in GenBank.
Pseudomonas: Annotations from the Pseudomonas Genome Project.
Example Deep Link URL:
NCBI E. coli Genome
The accession numbers are the same as in GenBank
Representative Public ID
The accession number of a representative sequence. Note that for consensus-based probe sets, the representative sequence is only one of several sequences (sequence sub-clusters) used to build the consensus sequence and it is not directly used to derive the probe sequences. The representative sequence is chosen during array design as a sequence that is best associated with the transcribed region being interrogated by the probe set. Refer to the "Sequence Source" field to determine the database used.
(The name of this field may be written as "Sequence Derived From".)
GenBank description associated with the representative public identifier. Blank for some probe sets.
Cluster identification number with a sub-cluster identifier appended. Currently provided only for HG-U133, HG-Focus, and newer arrays.
Identifier associated with the group of sequences that are detected by a probe set. The identifier is a number assigned by Affymetrix and does not refer to any external database. For HG-U133, HG-Focus, and newer arrays, the number of UniGene clusters detected by the probe set is also provided.
Example: "1500074332 (2 transcripts, 2 gene clusters)"
Beginning with the arrays released in 2003, we are providing more information in this field. Instead of an Affymetrix ID number, we list all the UniGene transcripts (group members) that are detected by the probe set.
Examples: "Mm.21841.1:Mm.21841.2:Mm.190631.1:Mm.21841.4:Mm.218761.1 (5 transcripts, 3 gene clusters)"
Archival UniGene Cluster
UniGene cluster ID, curated at the time of array design, of the group of sequences used to build the consensus sequence. This is static information and never gets updated. However, since UniGene clusters are frequently retired or split into other clusters, we update the UniGene cluster ID for each probe set every quarter and this latest UniGene ID is provided as a different field (called "UniGene ID") in the download files and NetAffx annotation reports.
Title of the reference sequence. Blank in many cases.
Title of Gene represented by the probe set.
A gene symbol, when one is available. Such symbols are assigned by different organizations for different species. Our data come from the UniGene record. We do not attempt to indicate which species-specific databank was used, but some of the possibilities include:
- HUGO: The Human Genome Organization
- RGD: The Rat Genome Database
- MGD: Mouse Genome Database Project at MGI
- SubtiList (Bacillus subtilis)
Chromosomal location. We add the prefix "Chr:" to avoid problems with some spreadsheet applications interpreting some locations as Dates.
UniGene accession number.
Example Deep Link:
LocusLink accession number.
SWISS-PROT (sometimes known as SWALL) accession numbers. SWISS-PROT entries also have an ID, in addition to an accession number. (For example, the accession number "P11474" has the ID "ERR1_HUMAN".) You can use the same deep-link URL to link to entries based on either an accession number or an ID.
Example: "P11474 /// Q8N4S8 /// Q96F89 /// Q96I02"
Enzyme Commission family number.
Note: Proper EC numbers have four levels. Sometimes we cannot provide an annotation beyond the third level, resulting in a number like "EC:3.4.24" (or possibly "EC:3.4.24.-"). You can look up data on this sort of number with a URL like this: http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/4/24/
OMIM: Online Mendelian Inheritance in Man accession number.
Example Deep Link:
An ID from the Ensembl project.
Example Deep Link:
References to a variety of other databases appear in separate columns.
|AGI:||TAIR: The Arabidopsis Information Resource|
|FLYBASE:||FlyBase: A Database of the Drosophila Genome|
|MGI Name:||Mouse Genome Informatics Database|
|RGD Name:||Rat Genome Database|
|SGD Accession:||SGD: Saccharomyces Genome Database|
|WORMBASE:||WormBase: The Genome and Biology of C. elegans|
RefSeq Transcript ID
References to multiple sequences in RefSeq. The field contains the ID and Description for each entry, and there can be multiple entries per ProbeSet.
Example: "NM_002662 // phospholipase D1, phophatidylcholine-specific"
Example Deep Link:
RefSeq Protein ID
ID of the protein sequence in the NCBI RefSeq database.
Gene Ontology (GO) Data
The following types of annotation appear in separate columns:
- Biological Process (GO)
- Cellular Component (GO)
- Molecular Function (GO)
Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended".
"Direct evidence" shows the type of evidence provided by the GO Consortium for their curation of a relationship between the LocusLink accession and the GO Ontology term.
Example: "7155 // cell adhesion // predicted/computed"
"Extended evidence" indicates that the Ontology term was curated by Affymetrix Inc. based on similarity with genes that are annotated by the GO Consortium. Data in this field relates the GO term to a PFam or EC number related to the reference sequence. In the first example below the enzyme with EC number "184.108.40.206" aligns to the reference sequence with an E-Value = "5.38e-88", and the relationship between that enzyme and the GO ID "4278" is "inferred from electronic annotation". In the second example the PFam accession ID "EMP24_GP25L" aligns to the reference sequence with an E-Value of "2e-57", and the relationship between that PFam accession ID and GO ID "8320" is "Unknown".
Example: "4278 // granzyme B // extended:inferred from electronic annotation; 220.127.116.11; 5.38e-88 /// 8320 // protein carrier // extended:Unknown; EMP24_GP25L; 2e-57"
Example Deep Link:
InterPro accession number and description.
Example: "IPR003594 // ATP-binding protein, ATPase-like"
Example Deep Link:
Pfam and SCOP Annotations
A Transmembrane Helix Predictions are generated using the TMHMM program method as filtered by InterPro. See http://www.cbs.dtu.dk/~krogh/TMHMM/
The TMM annotations for each probe set are of the format:
Example: "NP_054700.1 // span:417-439 // numtm:1"
Format: "ID// span:(domain boundaries)// numtm:(Number of Domains)"
References to entries in the various databases, indicated by namespace prefixes. In each case, our annotations include an accession number and a description. Each description may end with an E-value.
"GPCR:" indicates entries from GPCRDB: Information system for G protein-coupled receptors (GPCRs).
"P450:" indicates an entry from The Cytochrome P450 Database.
"HANKS:" indicates an entry from The Protein Kinase Resource of Steven Hanks.
"EC:18.104.22.168 // HE_PARLI HATCHING ENZYME PRECURSOR (EC 22.214.171.124)(HE) (HEZ) (ENVELYSIN) (SEA-URCHIN-HATCHING PROTEINASE).;3.8e-171"
"GPCR:GRFR_HUMAN // family_2B_GHRH"
"P450:CYP2A-4 // CYP2A-4 Cyt P450:Animalia.CYP2A-4.mou;0"
Position of the alignment of the target sequence on the genome.
Format: "chromosome:start-end (strand) // identity // cytoband"
Example: "chr6:30964144-30975910 (+) // 95.63 // p21.33"
Version of the Genome used to generate data under "Alignments" column.
ADDED IN JUNE 2003 RELEASE
Quantitative Trait Loci (QTL) are genetic linkage data that provide disease associations for various loci. These annotations only appear on Rat arrays. (The column is present, but empty for other arrays.) See QTLs Help at the Rat Genome Database, which is the source of our data.
Format: "QTL ID // Full Name // Trait // Association"
Example: "61421 // CIA Severity QTL 12 // Arthritis Severity // candidate"
Example deep link:
Annotation Transcript Cluster
Contains the IDs of the transcripts used in the transcript to gene annotation mapping.
Format: "mRNA(number of matching probes),mRNA2(number of matching probes)"
Description of the method used to annotate the probe set.
Example: This probe set was annotated using the Matching Probes based pipeline to a Locus Link identifier using 1 transcripts. // false // Matching Probes // A
Format: human readable column comment // ambiguous assignment(true/false) // method // annotation class (see white paper)
Contains the mRNA to probe set assignments.
ENST00000259875 // cdna:known chromosome:NCBI34:6:30958112:30974184:1 // ensembl // 16 // --- /// NM_013994 // Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 3, mRNA. // refseq // 16 // chr6:30964148-30975910(+)
Format: mRNA ID//description//source//number of matching probes//genome alignment of mRNA overlap ( if any )
Contains assignments as discussed in the white paper above for cross hyb and negative strand alignments.
Negative strand assignments are common when a probe set is designed and the orientation of the EST data is conflicting. Probe sets are then created from both directions.
Example: ENST00000325423 // ensembl // 1 // Negative Strand Matching Probes /// GENSCAN00000014431 // ensembl // 8 // Cross Hyb Matching Probes
Format: mRNA id // source // number of probes // Negative Strand Matching Probes or Cross Hyb Matching Probes
A Note On E-Values
Many of our annotations contain an "E-value" as a part of their description field. An E-value reveals how well the RefSeq peptide sequence corresponding to our probe set matches to the sequence in question. It represents the number of matches expected to occur by chance in the given database. Smaller values indicate better matches. (An E-value is dependent on both the query and the database size and can take any non-negative value.)
When present, the E-value occurs at the end of the description field, preceded by a semi-colon. Example: ";8.9e-18".
ADDED IN OCTOBER 2003 RELEASE
Beginning with the October 2003 annotation update, Ortholog/Homolog annotations are provided in separate files.
Ortholog files contain cross-references between probe sets on two Affymetrix GeneChip arrays where the reference sequences on which the two probes are based have a significant amount of similarity. The data on reference-sequence similarity are taken from HomoloGene and then cross-referenced by us to Affymetrix probe sets.
Ortholog files contain these columns:
- Probe Set ID
- GeneChip Array
- Ortholog Probe Set
- Ortholog Array
- Ortholog Target Title
Any given probe set can be associated with multiple ortholog probe sets. These will appear as separate lines in the file.
Ortholog Type can be "Curated Ortholog" or "Putative Ortholog".
BLASTP And BLASTX Files
ADDED IN OCTOBER 2003 RELEASE
Beginning with the October 2003 annotation update, BLASTP and BLASTX annotations are provided in separate files.
BLASTP files contain pre-computed BLASTP results against GenBank NR database. (All non-redundant GenBank CDS translations + RefSeq Proteins + PDB + SwissProt + PIR + PDF).
BLASTX files contain Pre-computed BLASTX results against SwissProt/TrEMBL database peptides.
BLASTP files contain these columns:
- Probe Set ID
- BLASTP Hit Name (GI number, GenBank NR)
- Hit Description
BLASTX files contain these columns:
- Probe Set Id
- BLASTX Hit Name (Swissprot/TrEMBL)
- Hit Description