Affymetrix
Quick Order
Welcome, Guest

Probe Set Data in Tabular Format

Bookmark and Share

The annotation data available in the NetAffx™ Analysis Center for GeneChip® arrays are also available in a comma-separated-values tabular format. Annotation files for each array will be updated when the data on the NetAffx Analysis Center are updated for that array, generally on a quarterly basis, using the method described here. The annotation date is included as a field inside each file. (Sequence data is constant, and does not need to be updated when annotations are updated.)

Annotation data, Sequence data, BLAST results, and lists of between-chip orthologs are available in separate files. To find support files, select the chip you are interested in from the list on the support by product page.

The files are provided in "ZIP" compressed format.

Although we have made every effort to make these files as complete and accurate as possible, we do not guarantee that they are free from errors or omissions. In particular, errors in data received from outside sources are beyond our control. Please refer to our Terms and Conditions for details on the acceptable use of these data.

The format of the files may change somewhat whenever we update our annotation data or choose to include additional forms of annotations. Refer to the release notes below.

Release Notes

October 2004 Update

Minor changes occurred with the October 2004 release due to changes in annotation methods.

  • Removed columns: Overlapping Transcripts
  • Altered columns: Alignments, Annotation Transcript Cluster, Annotation Description, Trans Membrane
  • New columns: Transcript Assignments, Annotation Notes

March 2004 Update

Significant changes occurred with the March 2004 release, please refer to the column descriptions below.

October 2003 Update

Beginning with the October 2003 release, BLAST and Ortholog/Homolog annotations are being provided in separate files. Proteome BioKnowledge® Library data will no longer be provided (due to licensing issues).

The following columns have been removed:

  • Protein Similarities BLASTP (GenBank NR)
  • Protein Similarities BLASTX (SwissProt/TrEMBL)
  • Orthologs/Homologs
  • All 5 Proteome columns

June 2003 Update

In the June 2003 release includes the following additions:

  • Annotations for control probe sets are now included.
  • Cluster type information is now included along with the Unigene accession number
  • An extra data column has been added at the end of each row for Quantitative Trait Loci (QTL). The QTL column contains useful data for rat probe sets.
  • The format of the TMM data has been updated to contain more information.

Understanding the Tabular Formatted Probe Set Files

The files are available in a comma-separated-values (CSV) format. These are plain-text files with each row terminated by a new-line character. Data in separate fields are enclosed in quotation marks and separated by commas. None of the data fields contains any of these characters: quotation mark, new-line, carriage return, or tab.

We expect these files to be used primarily in spreadsheet applications and database programs (such as SQL databases). We have tried to format the data in such a way to make both of these uses relatively easy. Note that some of the files, and the data fields in them are large.

There is one file per GeneChip array. Thus there are separate files for HG-U133A and HG-U133B, rather than one file for the Human Genome U133 Set.

The first row of each file contains the titles of the fields contained in the subsequent rows.

Each row after the first row contains annotations for a single probe set. All annotations for that probe set are contained in that single row. In some fields, such as the protein domain annotations, there can be more than one annotation for a single probe set. In this case, the multiple values are separated by the string " /// ".

In many types of annotations, sub-fields are separated by " // ". For example, an annotation for a "GO Biological Process" might appear as "7155 // cell adhesion // predicted/computed". In this case, the sections correspond to "ID // Description // Evidence", but the meaning of the sub-fields varies between different types of annotation, as described below.

Empty fields are indicated by "---". The point of using such a string rather than leaving the field empty is that it makes the columnar nature of the data more clearly visible in certain spreadsheet programs.

Some columns in some files contain no data. To help users merge data from multiple files, such empty columns are not removed. Thus each file has the same columns in the same order.

Some fields, such as "Chip," contain the same value for every probe set in a file. Although these data are redundant within any individual file, they are useful to users who merge data from multiple files.

Data Fields

In the following sections, we describe the content of each field of the data files. The fields are of four types:


When possible, we provide instructions on how to create deep links to data from various public databases. Take care to read the terms and conditions of a site before using deep links to it. Also, please be aware that the location of and availability of these databases may change from time to time.

GeneChip Array Information

Probe Set ID

The probe set identifier.

Examples: 200007_at, 200011_s_at, 200012_x_at.

You may create deep links to probe set information in the NetAffx Analysis Center, subject to our Terms and Conditions. Usually it will be sufficient to simply supply the Probe Set ID(s), but sometimes you also may want to indicate the array. Deep links are fully explained in the Direct Access to Probe Set Information manual.

Deep Link Example (you will be asked to log in):
https://www.affymetrix.com/LinkServlet?probeset=10156_at

GeneChip Array

The GeneChip probe array name.

Species Scientific Name

The genus and species of the organism represented by the probe set.

Annotation Date

The date that the annotations for this probe array were last updated. It will generally be earlier than the date when the annotations were posted on the web site.

Probe Design Information

Sequence Type

Indicates whether the sequence is an Exemplar, Consensus or Control sequence. An Exemplar is a single nucleotide sequence taken directly from a public database. This sequence could be an mRNA or EST. A Consensus sequence, is a nucleotide sequence assembled by Affymetrix, based on one or more sequence taken from a public database.

Sequence Source

The database from which the sequence used to design this probe set was taken.

Examples (including deep links, when available): Refer to the terms and conditions of each site before using such deep links.


Annotations beginning with "TC" refer to the TIGR Mouse Gene Index. Annotations beginning with "HT" (Human) or "ET" (other species) are sequence IDs from The Expressed Gene Anatomy Database (EGAD).
Annotations beginning with "HG" are gene IDs; use the sequence IDs for deep links.

Example Deep Link URL for TC numbers: http://www.tigr.org/tigr-scripts/tgi/tc_report.pl?species=1&tc=TC641394
Example Deep Links URLs for Sequence IDs:
http://www.tigr.org/tigr-scripts/tgi/egad_report.pl?htnum=ET63226
http://www.tigr.org/tigr-scripts/tgi/egad_report.pl?htnum=HT3236

Affymetrix Proprietary Database

Stanford Public Database (SGD™: Saccharomyces Genome Database)

TubercuList

Blattner: University of Wisconsin at Madison, E. coli Genome Database. F. R. Blattner's lab.

The accession numbers are the same as in GenBank.

Pseudomonas: Annotations from the Pseudomonas Genome Project.

Example Deep Link URL:
http://www.pseudomonas.com/AnnotationByPAU.asp?PA=PA0007

FlyBase

NCBI E. coli Genome
The accession numbers are the same as in GenBank

Representative Public ID

The accession number of a representative sequence. Note that for consensus-based probe sets, the representative sequence is only one of several sequences (sequence sub-clusters) used to build the consensus sequence and it is not directly used to derive the probe sequences. The representative sequence is chosen during array design as a sequence that is best associated with the transcribed region being interrogated by the probe set. Refer to the "Sequence Source" field to determine the database used.

(The name of this field may be written as "Sequence Derived From".)

Target Description

GenBank description associated with the representative public identifier. Blank for some probe sets.

Transcript ID

Cluster identification number with a sub-cluster identifier appended. Currently provided only for HG-U133, HG-Focus, and newer arrays.

Group ID

Identifier associated with the group of sequences that are detected by a probe set. The identifier is a number assigned by Affymetrix and does not refer to any external database. For HG-U133, HG-Focus, and newer arrays, the number of UniGene clusters detected by the probe set is also provided.

Example: "1500074332 (2 transcripts, 2 gene clusters)"

Beginning with the arrays released in 2003, we are providing more information in this field. Instead of an Affymetrix ID number, we list all the UniGene transcripts (group members) that are detected by the probe set.

Examples: "Mm.21841.1:Mm.21841.2:Mm.190631.1:Mm.21841.4:Mm.218761.1 (5 transcripts, 3 gene clusters)"

Archival UniGene Cluster

UniGene cluster ID, curated at the time of array design, of the group of sequences used to build the consensus sequence. This is static information and never gets updated. However, since UniGene clusters are frequently retired or split into other clusters, we update the UniGene cluster ID for each probe set every quarter and this latest UniGene ID is provided as a different field (called "UniGene ID") in the download files and NetAffx annotation reports.

Public Domain and Genomic References

Most of the data in this section come from LocusLink and UniGene, and are annotations of the reference sequence on which the probe set is modeled.

Title of the reference sequence. Blank in many cases.

Gene Title

Title of Gene represented by the probe set.

Gene Symbol

A gene symbol, when one is available. Such symbols are assigned by different organizations for different species. Our data come from the UniGene record. We do not attempt to indicate which species-specific databank was used, but some of the possibilities include:


Chromosome Location

Chromosomal location. We add the prefix "Chr:" to avoid problems with some spreadsheet applications interpreting some locations as Dates.

UniGene ID

UniGene accession number.

Example Deep Link:
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Hs&CID=55682

LocusLink

LocusLink accession number.

SwissProt

SWISS-PROT (sometimes known as SWALL) accession numbers. SWISS-PROT entries also have an ID, in addition to an accession number. (For example, the accession number "P11474" has the ID "ERR1_HUMAN".) You can use the same deep-link URL to link to entries based on either an accession number or an ID.

Example: "P11474 /// Q8N4S8 /// Q96F89 /// Q96I02"

Example Deep Links:
http://us.expasy.org/cgi-bin/get-sprot-entry?P11474
http://us.expasy.org/cgi-bin/get-sprot-entry?ERR1_HUMAN

EC

Enzyme Commission family number.

Example Deep Links:
http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/4/24/25.html
http://www.expasy.ch/cgi-bin/nicezyme.pl?3.4.24.25

Note: Proper EC numbers have four levels. Sometimes we cannot provide an annotation beyond the third level, resulting in a number like "EC:3.4.24" (or possibly "EC:3.4.24.-"). You can look up data on this sort of number with a URL like this: http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/4/24/

OMIM

OMIM™: Online Mendelian Inheritance in Man accession number.

Example Deep Link:
http://www3.ncbi.nlm.nih.gov/htbin-post/Omim/dispmim?603167

Ensembl

An ID from the Ensembl project.

Example Deep Link:
http://www.ensembl.org/Mus_musculus/geneview?gene=ENSMUSG00000032603

References

References to a variety of other databases appear in separate columns.

Column URL
AGI: TAIR: The Arabidopsis Information Resource
FLYBASE: FlyBase: A Database of the Drosophila Genome
MGI Name: Mouse Genome Informatics Database
RGD Name: Rat Genome Database
SGD Accession: SGD™: Saccharomyces Genome Database
WORMBASE: WormBase: The Genome and Biology of C. elegans

RefSeq Transcript ID

References to multiple sequences in RefSeq. The field contains the ID and Description for each entry, and there can be multiple entries per ProbeSet.

Example: "NM_002662 // phospholipase D1, phophatidylcholine-specific"

Example Deep Link:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_002662

RefSeq Protein ID

ID of the protein sequence in the NCBI RefSeq database.

Functional Annotations

Gene Ontology (GO) Data

Data referring to the Gene Ontology™ Consortium.
See also AmiGO and QuickGO.

The following types of annotation appear in separate columns:

  • Biological Process (GO)
  • Cellular Component (GO)
  • Molecular Function (GO)

Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended".

"Direct evidence" shows the type of evidence provided by the GO Consortium for their curation of a relationship between the LocusLink accession and the GO Ontology term.

Example: "7155 // cell adhesion // predicted/computed"

"Extended evidence" indicates that the Ontology term was curated by Affymetrix Inc. based on similarity with genes that are annotated by the GO Consortium. Data in this field relates the GO term to a PFam or EC number related to the reference sequence. In the first example below the enzyme with EC number "3.4.21.79" aligns to the reference sequence with an E-Value = "5.38e-88", and the relationship between that enzyme and the GO ID "4278" is "inferred from electronic annotation". In the second example the PFam accession ID "EMP24_GP25L" aligns to the reference sequence with an E-Value of "2e-57", and the relationship between that PFam accession ID and GO ID "8320" is "Unknown".

Example: "4278 // granzyme B // extended:inferred from electronic annotation; 3.4.21.79; 5.38e-88 /// 8320 // protein carrier // extended:Unknown; EMP24_GP25L; 2e-57"

Example Deep Link:
http://godatabase.org/cgi-bin/go.cgi?query=7155&view=details&depth=0

Pathways

References to Gene MicroArray Pathway Profiler and KEGG: Kyoto Encyclopedia of Genes and Genomes.

InterPro

InterPro accession number and description.

Example: "IPR003594 // ATP-binding protein, ATPase-like"

Example Deep Link:
http://www.ebi.ac.uk/interpro/IEntry?ac=IPR003594

Protein Domains

Pfam and SCOP Annotations

Example Deep Links:
http://pfam.wustl.edu/cgi-bin/getdesc?name=p450
http://scop.berkeley.edu/search.cgi?key=d1d8db_

Trans Membrane

A Transmembrane Helix Predictions are generated using the TMHMM program method as filtered by InterPro. See http://www.cbs.dtu.dk/~krogh/TMHMM/

The TMM annotations for each probe set are of the format:
Example: "NP_054700.1 // span:417-439 // numtm:1"
Format: "ID// span:(domain boundaries)// numtm:(Number of Domains)"

Protein Families

References to entries in the various databases, indicated by namespace prefixes. In each case, our annotations include an accession number and a description. Each description may end with an E-value.

"GPCR:" indicates entries from GPCRDB: Information system for G protein-coupled receptors (GPCRs).

"EC:" indicates an EC Number. Those entries include a symbol, then the EC Number itself, then a description. The first word in the description is a SwissProt ID, such as "HE_PARLI".

"P450:" indicates an entry from The Cytochrome P450 Database.

"HANKS:" indicates an entry from The Protein Kinase Resource of Steven Hanks.

Examples:
"EC:3.4.24.12 // HE_PARLI HATCHING ENZYME PRECURSOR (EC 3.4.24.12)(HE) (HEZ) (ENVELYSIN) (SEA-URCHIN-HATCHING PROTEINASE).;3.8e-171"
"GPCR:GRFR_HUMAN // family_2B_GHRH"
"P450:CYP2A-4 // CYP2A-4 Cyt P450:Animalia.CYP2A-4.mou;0"

Alignments

Position of the alignment of the target sequence on the genome.

Format: "chromosome:start-end (strand) // identity // cytoband"

Example: "chr6:30964144-30975910 (+) // 95.63 // p21.33"

Genome Version

Version of the Genome used to generate data under "Alignments" column.

QTL

ADDED IN JUNE 2003 RELEASE

Quantitative Trait Loci (QTL) are genetic linkage data that provide disease associations for various loci. These annotations only appear on Rat arrays. (The column is present, but empty for other arrays.) See QTLs Help at the Rat Genome Database, which is the source of our data.

Format: "QTL ID // Full Name // Trait // Association"

Example: "61421 // CIA Severity QTL 12 // Arthritis Severity // candidate"

Example deep link:
http://rgd.mcw.edu/tools/qtls/qtls_view.cgi?id=61421

Annotation Transcript Cluster

Contains the IDs of the transcripts used in the transcript to gene annotation mapping.

Example: "M87338(15),NM_181471(12)"
Format: "mRNA(number of matching probes),mRNA2(number of matching probes)"

Annotation Description

Description of the method used to annotate the probe set.

Example: This probe set was annotated using the Matching Probes based pipeline to a Locus Link identifier using 1 transcripts. // false // Matching Probes // A
Format: human readable column comment // ambiguous assignment(true/false) // method // annotation class (see white paper)

Transcript Assignments

Contains the mRNA to probe set assignments.

Example: ENST00000259875 // cdna:known chromosome:NCBI34:6:30958112:30974184:1 // ensembl // 16 // --- /// NM_013994 // Homo sapiens discoidin domain receptor family, member 1 (DDR1), transcript variant 3, mRNA. // refseq // 16 // chr6:30964148-30975910(+)
Format: mRNA ID//description//source//number of matching probes//genome alignment of mRNA overlap ( if any )

Annotation Notes

Contains assignments as discussed in the white paper above for cross hyb and negative strand alignments.

Negative strand assignments are common when a probe set is designed and the orientation of the EST data is conflicting. Probe sets are then created from both directions.
Example: ENST00000325423 // ensembl // 1 // Negative Strand Matching Probes /// GENSCAN00000014431 // ensembl // 8 // Cross Hyb Matching Probes
Format: mRNA id // source // number of probes // Negative Strand Matching Probes or Cross Hyb Matching Probes

A Note On E-Values

Many of our annotations contain an "E-value" as a part of their description field. An E-value reveals how well the RefSeq peptide sequence corresponding to our probe set matches to the sequence in question. It represents the number of matches expected to occur by chance in the given database. Smaller values indicate better matches. (An E-value is dependent on both the query and the database size and can take any non-negative value.)

When present, the E-value occurs at the end of the description field, preceded by a semi-colon. Example: ";8.9e-18".

Ortholog Files

ADDED IN OCTOBER 2003 RELEASE

Beginning with the October 2003 annotation update, Ortholog/Homolog annotations are provided in separate files.

Ortholog files contain cross-references between probe sets on two Affymetrix GeneChip arrays where the reference sequences on which the two probes are based have a significant amount of similarity. The data on reference-sequence similarity are taken from HomoloGene and then cross-referenced by us to Affymetrix probe sets.

Ortholog files contain these columns:

  • Probe Set ID
  • GeneChip Array
  • Ortholog Probe Set
  • Ortholog Array
  • Ortholog Target Title

Any given probe set can be associated with multiple ortholog probe sets. These will appear as separate lines in the file.

Ortholog Type can be "Curated Ortholog" or "Putative Ortholog".

BLASTP And BLASTX Files

ADDED IN OCTOBER 2003 RELEASE

Beginning with the October 2003 annotation update, BLASTP and BLASTX annotations are provided in separate files.

BLASTP files contain pre-computed BLASTP results against GenBank NR database. (All non-redundant GenBank CDS translations + RefSeq Proteins + PDB + SwissProt + PIR + PDF).

BLASTX files contain Pre-computed BLASTX results against SwissProt/TrEMBL database peptides.

BLASTP files contain these columns:

  • Probe Set ID
  • Chip
  • BLASTP Hit Name (GI number, GenBank NR)
  • Hit Description
  • E-value

BLASTX files contain these columns:

  • Probe Set Id
  • Chip
  • BLASTX Hit Name (Swissprot/TrEMBL)
  • Hit Description
  • E-value