The NetAffx Analysis Center provides annotations for probe sets on GeneChip® probe arrays. These annotations include (1) static information specific to the probe set composition, (2) sequence annotations extracted from public databases, and (3) protein sequence-level annotations derived from public domain programs, as well as libraries of Hidden Markov Models developed at Affymetrix.
Human, Mouse, and Rat Arrays
For these arrays, static probe set information includes the following:
- Sequence accession and textual description of the representative sequence derived from GenBank. The representative sequence is chosen during array design as a sequence that is best associated with the transcribed region being interrogated by the probe set.
- UniGene cluster identifier for this gene
The UniGene cluster associated with this representative sequence at the time of product release is provided as the "Archival Reference Group" as the UniGene cluster maintained by NCBI may change or be removed after array design.
- Sub-cluster from which the probe set is derived
- Set of sequence identifiers comprising the sub-cluster. The sequence identifiers in the sub-cluster are called the "cluster members" in the NetAffx Analysis Center.
Consider HG-U95 probe set 32313_at. The representative sequence has the GenBank accession M12125. The GenBank record indicates that the sequence definition is "Human fibroblast muscle-type tropomyosin mRNA, complete cds." The sequence is a member of a sub-cluster of 3 sequences (M75165, M12125, and M74817), which are a subset of UniGene (build #95) cluster Hs.180266. This information never changes; the data is derived at the time of array design and is permanently associated with the probe set.
Arabidopsis, C. elegans, Drosophila, and S. cerevisiae Arrays
For these arrays, the primary static probe set information is the sequence accession and textual description of the representative sequence which is derived from GenBank, FlyBase, TAIR, SGD, or WormBase. The representative sequence is chosen during array design as a sequence that is best associated with the transcribed region being interrogated by the probe set.
Consider the C. elegans genome array probe set 193479_at. The representative sequence has the WormBase accession CE06073. The WormBase record indicates that the sequence definition is "homeobox protein (NK-2 subfamily)". This information never changes; the data is derived at the time of array design and is permanently associated with the probe set.
Human, Mouse, and Rat Arrays
For these arrays, the annotations derived from public databases include descriptive and functional annotations of the protein sequence from current NCBI releases of the UniGene, LocusLink, OMIM, and HomoloGene databases. An attempt is made to associate each probe set with the one best of each of the UniGene and LocusLink entries. The UniGene identifier is determined according to the probe set's representative sequence. When possible, if the representative sequence is not found in the current UniGene database, then the most common UniGene identifier of the sub-cluster sequences is used.
The UniGene title, gene symbol, and cytogenetic bands are also extracted from the UniGene database. Similarly, LocusLink information such as Gene Ontology terms is assigned to the probe set via the LocusLink identifier in the UniGene record. In addition, HomoloGene gives some homolog/ortholog relationships for probe sets on other GeneChip arrays.
Continuing with the probe set 32313_at example, the representative sequence M12125 is found in the current (March, 2002) UniGene cluster Hs.300772. The probe set 32313_at is assigned the identifier Hs.300772 along with the associated title "tropomyosin 2 (beta)" and gene symbol "TPM2" from the UniGene record.
Consider HG-U95 probe set 1010_at, a mitogen-activated protein kinase. The probe set description page has links to UniGene entry Hs.57732, LocusLink entry 5600, SWISS-PROT Q15759 and OMIM report 602898. The GO Classifications from LocusLink include the Biological Process terms GO:7165 (signal transduction) and GO:6950 (stress response), as well as the Molecular Function term GO:4707 (MAP Kinase). The Orthologs/Homologs section, derived from Homologene, shows that Drosophila probe sets 143711_at and 154281_at are homologs at 72.0 and 87.0 percent identity respectively.
Arabidopsis, C. elegans, Drosophila, and S. cerevisiae Arrays For these arrays, annotations derived from public databases include descriptive and functional annotations of the protein sequence from current releases of FlyBase, TAIR, TIGR, SGD, or WormBase databases. An attempt is made to associate each probe set with one best database entry. The database entry is determined according to the probe set's representative sequence. If the representative sequence is found in the current database, then the annotations of that database entry are associated with the probe set. In addition, HomoloGene gives some ortholog relationships for probe sets on other GeneChip arrays.
Continuing with the probe set 193479_at example, the representative sequence CE06073 is found in the current (April, 2002) WormBase database. The probe set 193479_at is assigned the title "homeobox protein (NK-2 subfamily)" and gene symbol "ceh-28" from the WormBase record. The probe set description page has links to WormBase entry CE06073, and SWISS-PROT Q21169. The GO Classifications (for the WormBase accession CE06073) from Gene Ontology include the Biological Process term GO:6355 (transcription regulation), the Cellular Component term GO:5634 (nucleus), as well as the Molecular Function term GO:3700 (transcription factor). The InterPro domains IPR000047 (Helix-turn-helix / lambda and other repressors) and IPR001356 (Homeobox domain) were extracted from InterPro using the SWISS-PROT accession Q21169.
Drosophila, Human, Mouse, Rat, S. cerevisiae Arrays For these arrays, Affymetrix provides protein annotations derived by sequence homology using collections of Hidden Markov Models (HMMs) representing well-characterized protein domains. These collections include: (1) Structural Classification of Proteins (SCOP) with models representing all solved 3D protein structures from the Protein Databank (PDB) organized by protein structure and function; (2) Enzyme Classification (EC) with models representing all known enzymes, organized by protein structure and enzymatic reaction, as well as substrate identity; (3) G-protein coupled receptors (GPCR) with highly optimized models representing structurally and functionally-related families of this well-characterized class of transmembrane proteins.
Public domain programs (BLOCKS and InterPro), and HMM collections (PFAM) are used to provide domain and motif-level annotations. These annotations include alignments between the motif and target sequence as well as e-value and percent identity scores. Hyperlinks to domain definitions (BLOCKS, PFAM) and outside sources (PDB, SCOP, GPCR), and pathways (KEGG and GenMAPP) are provided. BLAST analyses of protein sequences against the GenBank non-redundant database (nr) are also pre-computed.
Consider probe set 1010_at. NetAffx Protein Domains for InterPro, PFAM, BLOCKS, EC and SCOP are provided, showing a strong MAP kinase domain tendency as well as several other motifs related to signaling such as "GS motif preceding kinase" and "Kinase associated domain." The two proteins from the NCBI non-redundant protein database (nr) with the highest BLAST score both include a mitogen-activated protein kinase (gi2316012 and gi4506083).