home login register your profile contact        
Affymetrix
Products Support Analysis Scientific Community Corporate Careers Shop Affymetrix Japan
BY PRODUCT
Affymetrix Support - GeneChip Arrays GeneChip Arrays
Affymetrix Support - Assays and Reagents Assays & Reagents
Affymetrix Support - Instruments Instruments
Affymetrix Support - Software Software
BY SUPPORT TYPE
Affymetrix Support - Technical 
            Documentation Technical Documentation
Affymetrix Support - Application Notes Application Notes
Affymetrix Support - Product Brochures Brochures
Affymetrix Support - Product Data Sheets Data Sheets
Affymetrix Support - Frequently Asked Questions Frequently Asked Questions
Affymetrix Support - Manuals Manuals
Affymetrix Support - Material Safety Datasheets Material Safety Data Sheets
Affymetrix Support - Package Inserts Package Inserts
Affymetrix Support - Quick Reference Cards Quick Reference Cards
Affymetrix Support - Technical Notes Technical Notes
Affymetrix Support - Tutorials Tutorials
Affymetrix Support - White Papers White Papers
Affymetrix Support - Sample Data Data Resource Center
Affymetrix Support - Assay Panel Files Assay Panel Files
Affymetrix Support - NetAffx Annotation Files Annotation Files
Affymetrix Support - Library Files Library Files
Affymetrix Support - Sample Data Software Downloads
Affymetrix Support - Fluidics Scripts Fluidics Scripts
Affymetrix Support - Mask Files Mask Files
Affymetrix Support - Array Comparisons Array Comparisons
Affymetrix Support - Product Updates Product Updates
Affymetrix Support - Affymetrix Software Developer's Network Developers' Network
Affymetrix Support - GeneChip Compatible Partners - Software GeneChip Compatible Software
Affymetrix Support - Third Party Tools - Supported by Affymetrix Affymetrix Tools
Affymetrix Learning Center - Online Training LEARNING CENTER
Learning Center, Train on Affymetrix Tools and Instruments Learning Center Overview
Learning Center, Command Console Software Series Command Console®
Learning Center, Newark NJ - Data Analysis Workshops Data Analysis Workshops
Learning Center, CNAT 4.0 Overview BAT 2.0 Overview
Learning Center, CNAT 4.0 Overview CNAT 4.0 Overview
Learning Center, Genotyping Console Software Series Genotyping Console®
Learning Center, Genotyping Console Software Series NetAffx® Learning Center
Learning Center, GTYPE 4.1 Software Overview GTYPE 4.1 Overview
Learning Center, GTYPE 4.1 Software Overview Mapping 500k Assay
Learning Center, GTYPE 4.1 Software Overview WT Assay Tutorial
Tiling Analysis Software Tutorial Tiling Analysis Software Tutorial
Learning Center, Expression Data Analysis Series Expression Data
Analysis Series
SERVICE SUPPORT
Ordering Information
Affymetrix Support - Instument Installation Instrument Installation
Service Contracts
Affymetrix Services - List of Service Providers Service Providers
Affymetrix Services - Email Technical Support E-mail Technical Support
Affymetrix Services - FTP Secure File Exchange Secure File Exchange
  Probe Set Data in Tabular Format

The annotation data available in the NetAffx™ Analysis Center for GeneChip® arrays are also available in a comma-separated-values tabular format. Annotation files for each array will be updated when the data on the NetAffx Analysis Center are updated for that array, generally on a quarterly basis, using the method described here. The annotation date is included as a field inside each file. (Sequence data is constant, and does not need to be updated when annotations are updated.)

Annotation data, Sequence data, BLAST results, and lists of between-chip orthologs are available in separate files. To find support files, select the chip you are interested in from the list on the support by product page.

The files are provided in "ZIP" compressed format.

Although we have made every effort to make these files as complete and accurate as possible, we do not guarantee that they are free from errors or omissions. In particular, errors in data received from outside sources are beyond our control. Please refer to our Terms and Conditions for details on the acceptable use of these data.

The format of the files may change somewhat whenever we update our annotation data or choose to include additional forms of annotations. Refer to the release notes below.

RELEASE NOTES
October 2004 Update

Minor changes occurred with the October 2004 release due to changes in annotation methods.

Removed columns: Overlapping Transcripts
Altered columns: Alignments, Annotation Transcript Cluster, Annotation Description, Trans Membrane
New columns: Transcript Assignments, Annotation Notes
March 2004 Update

Significant changes occurred with the March 2004 release, please refer to the column descriptions below.

October 2003 Update

Beginning with the October 2003 release, BLAST and Ortholog/Homolog annotations are being provided in separate files. Proteome BioKnowledge® Library data will no longer be provided (due to licensing issues).

The following columns have been removed:
Protein Similarities BLASTP (GenBank NR)
Protein Similarities BLASTX (SwissProt/TrEMBL)
Orthologs/Homologs
All 5 Proteome columns

June 2003 Update
In the June 2003 release includes the following additions:
An extra data column has been added at the end of each row for Quantitative Trait Loci (QTL). The QTL column contains useful data for rat probe sets.
The format of the TMM data has been updated to contain more information.
UNDERSTANDING THE TABULAR FORMATTED PROBE SET FILES

The files are available in a comma-separated-values (CSV) format. These are plain-text files with each row terminated by a new-line character. Data in separate fields are enclosed in quotation marks and separated by commas. None of the data fields contains any of these characters: quotation mark, new-line, carriage return, or tab.

We expect these files to be used primarily in spreadsheet applications and database programs (such as SQL databases). We have tried to format the data in such a way to make both of these uses relatively easy. Note that some of the files, and the data fields in them are large.

There is one file per GeneChip array. Thus there are separate files for HG-U133A and HG-U133B, rather than one file for the Human Genome U133 Set.

The first row of each file contains the titles of the fields contained in the subsequent rows.

Each row after the first row contains annotations for a single probe set. All annotations for that probe set are contained in that single row. In some fields, such as the protein domain annotations, there can be more than one annotation for a single probe set. In this case, the multiple values are separated by the string " /// ".

In many types of annotations, sub-fields are separated by " // ". For example, an annotation for a "GO Biological Process" might appear as "7155 // cell adhesion // predicted/computed". In this case, the sections correspond to "ID // Description // Evidence", but the meaning of the sub-fields varies between different types of annotation, as described below.

Empty fields are indicated by "---". The point of using such a string rather than leaving the field empty is that it makes the columnar nature of the data more clearly visible in certain spreadsheet programs.

Some columns in some files contain no data. To help users merge data from multiple files, such empty columns are not removed. Thus each file has the same columns in the same order.

Some fields, such as "Chip," contain the same value for every probe set in a file. Although these data are redundant within any individual file, they are useful to users who merge data from multiple files.

DATA FIELDS

In the following sections, we describe the content of each field of the data files. The fields are of four types:

When possible, we provide instructions on how to create deep links to data from various public databases. Take care to read the terms and conditions of a site before using deep links to it. Also, please be aware that the location of and availability of these databases may change from time to time.
GENECHIP ARRAY INFORMATION
Probe Set ID

The probe set identifier.

Examples: 200007_at, 200011_s_at, 200012_x_at.

You may create deep links to probe set information in the NetAffx Analysis Center, subject to our Terms and Conditions. Usually it will be sufficient to simply supply the Probe Set ID(s), but sometimes you also may want to indicate the array. Deep links are fully explained in the Direct Access to Probe Set Information manual.

Deep Link Example (you will be asked to log in):
https://www.affymetrix.com/LinkServlet?probeset=10156_at

GeneChip Array
The GeneChip probe array name.
Species Scientific Name
The genus and species of the organism represented by the probe set.
Annotation Date
The date that the annotations for this probe array were last updated. It will generally be earlier than the date when the annotations were posted on the web site.

PROBE DESIGN INFORMATION

Sequence Type
Indicates whether the sequence is an Exemplar, Consensus or Control sequence. An Exemplar is a single nucleotide sequence taken directly from a public database. This sequence could be an mRNA or EST. A Consensus sequence, is a nucleotide sequence assembled by Affymetrix, based on one or more sequence taken from a public database.
Sequence Source

The database from which the sequence used to design this probe set was taken.

Examples (including deep links, when available): Refer to the terms and conditions of each site before using such deep links.
   
   
   
   
   

Representative Public ID

The accession number of a representative sequence. Note that for consensus-based probe sets, the representative sequence is only one of several sequences (sequence sub-clusters) used to build the consensus sequence and it is not directly used to derive the probe sequences. The representative sequence is chosen during array design as a sequence that is best associated with the transcribed region being interrogated by the probe set. Refer to the "Sequence Source" field to determine the database used.

(The name of this field may be written as "Sequence Derived From".)

Target Description
GenBank description associated with the representative public identifier. Blank for some probe sets.
Transcript ID
Cluster identification number with a sub-cluster identifier appended. Currently provided only for HG-U133, HG-Focus, and newer arrays.
Group ID

Identifier associated with the group of sequences that are detected by a probe set. The identifier is a number assigned by Affymetrix and does not refer to any external database. For HG-U133, HG-Focus, and newer arrays, the number of UniGene clusters detected by the probe set is also provided.

Example: "1500074332 (2 transcripts, 2 gene clusters)"

Beginning with the arrays released in 2003, we are providing more information in this field. Instead of an Affymetrix ID number, we list all the UniGene transcripts (group members) that are detected by the probe set.

Examples: "Mm.21841.1:Mm.21841.2:Mm.190631.1:Mm.21841.4:Mm.218761.1 (5 transcripts, 3 gene clusters)"

Archival UniGene Cluster
UniGene cluster ID, curated at the time of array design, of the group of sequences used to build the consensus sequence. This is static information and never gets updated. However, since UniGene clusters are frequently retired or split into other clusters, we update the UniGene cluster ID for each probe set every quarter and this latest UniGene ID is provided as a different field (called "UniGene ID") in the download files and NetAffx annotation reports.
PUBLIC DOMAIN AND GENOMIC REFERENCES
Most of the data in this section come from LocusLink and UniGene, and are annotations of the reference sequence on which the probe set is modeled.
Title of the reference sequence. Blank in many cases.
Gene Title
Title of Gene represented by the probe set.
Gene Symbol

A gene symbol, when one is available. Such symbols are assigned by different organizations for different species. Our data come from the UniGene record. We do not attempt to indicate which species-specific databank was used, but some of the possibilities include:

SubtiList (Bacillus subtilis)
Chromosome Location
Chromosomal location. We add the prefix "Chr:" to avoid problems with some spreadsheet applications interpreting some locations as Dates.
UniGene ID

UniGene accession number.

Example Deep Link:
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Hs&CID=55682

LocusLink
LocusLink accession number.
SwissProt

SWISS-PROT (sometimes known as SWALL) accession numbers. SWISS-PROT entries also have an ID, in addition to an accession number. (For example, the accession number "P11474" has the ID "ERR1_HUMAN".) You can use the same deep-link URL to link to entries based on either an accession number or an ID.

Example: "P11474 /// Q8N4S8 /// Q96F89 /// Q96I02"

Example Deep Links:
http://us.expasy.org/cgi-bin/get-sprot-entry?P11474
http://us.expasy.org/cgi-bin/get-sprot-entry?ERR1_HUMAN

EC

Enzyme Commission family number.

Example Deep Links:
http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/4/24/25.html
http://www.expasy.ch/cgi-bin/nicezyme.pl?3.4.24.25

Note: Proper EC numbers have four levels. Sometimes we cannot provide an annotation beyond the third level, resulting in a number like "EC:3.4.24" (or possibly "EC:3.4.24.-"). You can look up data on this sort of number with a URL like this: http://www.chem.qmul.ac.uk/iubmb/enzyme/EC3/4/24/

OMIM

OMIM™: Online Mendelian Inheritance in Man accession number.

Example Deep Link:
http://www3.ncbi.nlm.nih.gov/htbin-post/Omim/dispmim?603167

Ensembl

An ID from the Ensembl project.

Example Deep Link:

http://www.ensembl.org/Mus_musculus/geneview?gene=ENSMUSG00000032603
References

References to a variety of other databases appear in separate columns.
Column URL
AGI: TAIR: The Arabidopsis Information Resource
FLYBASE: FlyBase: A Database of the Drosophila Genome
MGI Name: Mouse Genome Informatics Database
RGD Name: Rat Genome Database
SGD Accession: SGD™: Saccharomyces Genome Database
WORMBASE: WormBase: The Genome and Biology of C. elegans

RefSeq Transcript ID

References to multiple sequences in RefSeq. The field contains the ID and Description for each entry, and there can be multiple entries per ProbeSet.

Example: "NM_002662 // phospholipase D1, phophatidylcholine-specific"

Example Deep Link:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_002662

RefSeq Protein ID
ID of the protein sequence in the NCBI RefSeq database.
FUNCTIONAL ANNOTATIONS
Gene Ontology (GO) Data

Data referring to the Gene Ontology™ Consortium.
See also AmiGO and QuickGO.

The following types of annotation appear in separate columns:

Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended".

"Direct evidence" shows the type of evidence provided by the GO Consortium for their curation of a relationship between the LocusLink accession and the GO Ontology term.

Example: "7155 // cell adhesion // predicted/computed"

"Extended evidence" indicates that the Ontology term was curated by Affymetrix Inc. based on similarity with genes that are annotated by the GO Consortium. Data in this field relates the GO term to a PFam or EC number related to the reference sequence. In the first example below the enzyme with EC number "3.4.21.79" aligns to the reference sequence with an E-Value = "5.38e-88", and the relationship between that enzyme and the GO ID "4278" is "inferred from electronic annotation". In the second example the PFam accession ID "EMP24_GP25L" aligns to the reference sequence with an E-Value of "2e-57", and the relationship between that PFam accession ID and GO ID "8320" is "Unknown".

Example: "4278 // granzyme B // extended:inferred from electronic annotation; 3.4.21.79; 5.38e-88 /// 8320 // protein carrier // extended:Unknown; EMP24_GP25L; 2e-57"

Example Deep Link:
http://godatabase.org/cgi-bin/go.cgi?query=7155&view=details&depth=0

Pathways

References to Gene MicroArray Pathway Profiler and KEGG: Kyoto Encyclopedia of Genes and Genomes.

InterPro

InterPro accession number and description.

Example: "IPR003594 // ATP-binding protein, ATPase-like"

Example Deep Link:
http://www.ebi.ac.uk/interpro/IEntry?ac=IPR003594

Protein Domains

Pfam and SCOP Annotations

Example Deep Links:
http://pfam.wustl.edu/cgi-bin/getdesc?name=p450
http://scop.berkeley.edu/search.cgi?key=d1d8db_

Trans Membrane

A Transmembrane Helix Predictions are generated using the TMHMM program method as filtered by InterPro. See http://www.cbs.dtu.dk/~krogh/TMHMM/

The TMM annotations for each probe set are of the format:
Example: "NP_054700.1 // span:417-439 // numtm:1"
Format: "ID// span:(domain boundaries)// numtm:(Number of Domains)"

Protein Families

References to entries in the various databases, indicated by namespace prefixes. In each case, our annotations include an accession number and a description. Each description may end with an E-value.

"GPCR:" indicates entries from GPCRDB: Information system for G protein-coupled receptors (GPCRs).

"EC:" indicates an EC Number. Those entries include a symbol, then the EC Number itself, then a description. The first word in the description is a Sw