README for Human Exon Array CSV annotation files. Copyright 2005-2006, Affymetrix Inc. All Rights Reserved The contents of this CSV file are covered by the terms of use or license located at http://www.affymetrix.com/site/terms.affx Array name: HuEx-1_0-st Organism: Homo sapiens Genome assembly: NCBI build 36, UCSC hg18, Mar 2006 NetAffx version: 20 (July 2006) CSV version: 1.7 This README provides a guide to the contents of the CSV files containing annotations for the Affymetrix Human Exon Array probe sets and transcript clusters. These files contain both design-time information as well as NetAffx assignments to public mRNAs, and are suitable for use as input to the ExACT software. Contents -------- I. General Notes A. File format conventions B. mRNA Assignment pipeline 1. Public mRNA data repository versions 2. Assignment Statistics 3. Assignment Scoring II. Probe Set CSV file: HuEx-1_0-st-probeset-annot.csv A. Content Description B. Column Descriptions III. Transcript CSV file: HuEx-1_0-st-transcript-annot.csv A. Content Description B. Column Descriptions IV. CSV File Version History I. General Notes ----------------- A. File format conventions The CSV files follow the conventions of other NetAffx tabular data files as described at: http://www.affymetrix.com/support/technical/manual/taf_manual.affx CSV files for exon arrays are a bit more complex than for the traditional 3'-IVT type Affymetrix expression arrays described at the above URL, owing to additional design-time information and increased multiplicity of public mRNA assignments per probe set and per transcript cluster. A given column within the exon array CSV file may contain single-valued or multi-valued data. In general, the design-time data is single-valued while the mRNA assignment information is multi-valued, because multiple public mRNAs may be assigned to a given probe set or transcript cluster. Furthermore, a given mRNA may contain multiple annotations of a given type (e.g., alternative gene names, GO biological process IDs, Pfam domains, etc.). Finally, each annotation may contain more than one type of data (e.g., each associated gene contains gene symbol, cytogenetic location, description, Entrez Gene ID). This set of hierarchical, many-to-one relationships is encoded in the CSV files using the following conventions. Within each column value, two types of field delimiters may appear: * " /// " - separates annotations for different mRNAs and also separates multiple annotations of a particular type for a given mRNA. The mRNA to which the annotation applies is identified by the first data type sub-field of the annotation. * " // " - separates different data type sub-fields for a single mRNA annotation Note that the spaces surrounding the forward slashes are considered part of the delimiter. An empty value (null) is indicated by three hyphens "---". Here's an example to illustrate: The probe set ID 3917874 has been assigned to several public mRNAs. Two of these are described in the mrna_assignment column of the probe set CSV file as follows: Probe set mrna_assignment column value _______________________________|_____________________________ | | "NM_000454 // chr21 // 100 /// ENST00000270142 // chr21 // 100" |_______| |___| |_| |_____________| |___| |_| | | | | | | accession chrm score accession chrm score |_________________________| |______________________________| | | assigned mRNA #1 assigned mRNA #2 In both the probe set and transcript cluster CSV files, the mrna_assignment column lists all assigned public mRNAs, with their accessions as the first data type sub-field. Other columns containing assigned mRNA annotations will contain data for only those mRNAs for which that data are available. The first data type sub-field of these other columns will contain the accession of the assigned mRNA to which the data for that column apply. For example, transcript cluster ID 3917851 (corresponding to the probe set in the example above) has eight assigned mRNAs, but only one of them is associated with a known gene. So the mrna_assignment column for this entry will have data for eight different mRNAs, delimited by seven " /// ", and the gene_assignment column will have just one mRNA sub-field. The first data type sub-field of the mRNA in the gene_assignment column contains the accession of the mRNA, listed in the mrna_assignment column, to which the gene assignment data applies. B. mRNA Assignment Pipeline Assignments to public mRNA were generated by the NetAffx annotation sequence analysis pipeline. Public mRNAs used for assignments came from Genbank, RefSeq, or Ensembl. The public mRNAs used to annotate the Exon Array are the same set that were used to annotate all other Affymetrix expression arrays available via NetAffx for the quarter listed above. Multiple mRNA assignments for a given probe set or transcript cluster are ranked according to data source. Multiple mRNAs are ordered in the mRNA assignment data columns as follows: RefSeq, EnsemblTranscript, Genbank, EnsemblEST, EnsemblPrediction Also note that only sense strand mRNA hits are reported. The majority of probe sets and transcript clusters do not have public mRNA assignments, since many are based on single-EST or gene predictor evidence that are not included in the set of public mRNAs used for NetAffx mRNA assignments. Details of the analysis pipeline along with additional statistics will be described in a separate document. 1. Public mRNA data repository versions Data Source Date Version ----------- ---------- -------- Genbank 2006-04-15 153 RefSeq 2006-05-01 17 Ensembl 2006-04-01 38 UniGene-Hs 2006-05-07 191 2. Assignment Statistics Probe Sets -------------------------------------------------------------- Level Total Assigned Unassigned Cross Hyb ------ --------- ------------- -------------- ------------ all 1404664 592000 (42.1%) 812664 (57.9%) 21964 (3.71%) core 284258 281998 (99.2%) 2260 (0.8%) 6074 (2.15%) extnd 519801 157962 (30.4%) 361839 (69.6%) 12087 (7.65%) full 577206 143642 (24.9%) 433564 (75.1%) 3418 (2.38%) Transcript clusters ----------------------------------------------------- Total Assigned Unassigned Cross Hyb ------- ------------- -------------- ----------- 312367 73715 (23.6%) 238652 (76.4%) 6376 (8.65%) 3. Assignment Scoring The scoring strategy, and the public mRNA assignment pipeline in general, takes advantage of the genome-based nature of the Exon Array design: Every probe, probe set, exon cluster, and transcript cluster has a unique genomic location. Each assignment between a public mRNA and a transcript cluster or probe set is assigned a score based on the number of probes that could be aligned directly to the mRNA relative to the amount that could potentially have aligned. When a given mRNA is aligned to the genome, it may span only a portion of the probe set or transcript cluster. To get the number of probes that could potentially align to the mRNA, we consider only those probes whose genomic locations are within the region of the mRNA as it aligns to the genome. The maximum assignment score is 100, indicating all possible probes align to the mRNA. An assignment score is not given in the case where the mRNA could not be aligned to the genome. Assignments for transcript clusters are also given a coverage score which is a rough measure of how much of the transcript cluster is covered by the mRNA in a genomic alignment. Coverage is computed by taking the total number of probes that could potentially align to the mRNA based on the mRNA-to-genome alignment, relative to the total number of probes in the transcript cluster. The maximum coverage score is 100, indicating that all probes in the transcript cluster could potentially align to the mRNA. So, to summarize: # probes directly aligned assignment score = ------------------------- x 100 # probes potentially aligned # probes potentially aligned coverage score = -------------------------- x 100 # probes total in TC For mRNAs that could not be aligned to the genome, the coverage score is determined by taking the number of probes that align directly to the mRNA, relative to the total number of probes in the transcript cluster. Thus, the coverage score for these mRNAs is really a hybrid between the assignment score and the coverage score for mRNAs that can be aligned to the genome. Weak mRNA assignments that are likely due to cross-hybridization are assessed by examining the assignment score or coverage score. If the assignment score is less than 50, the assignment is flagged as a cross-hyb. For mRNAs with no genomic alignment (and hence no assignment score), the coverage score is used instead. This cross-hyb assessment is first applied at the transcript cluster level and then propagated down to the constituent exon cluster and probe set levels for all exon clusters and probe sets within this transcript cluster that align to the mRNA in question. All alignments used BLAT software v.29 (http://www.genomeblat.com). For alignments involving probes, all 25 bases were required to align with no mismatches. II. Probe Set CSV file: HuEx-1_0-st-probeset-annot.csv ------------------------------------------------------ A. Content Description Each line contains information for a single probe set ID. The information consists of both design-time data, such as the genomic location, and mRNA assignments for any public mRNAs that should be detected by this probe set based on computational sequence alignment analysis. Only limited details are provided for each assigned mRNA. For more information about these mRNAs, see the corresponding entry in the transcript cluster CSV file (use the probe set's transcript cluster ID to join). Lines are sorted in ascending order by probe set ID, which corresponds closely with genomic location. B. Column Descriptions 1. probeset_id (integer) Unique identifier for the probe set. 2. seqname 3. strand (+|-) 4. start (integer) 5. stop (integer) Columns 2-5 contain the name and genomic location of the sequence from which the probe set was designed, on the assembly indicated at the top of this file. Coordinates are standard 1-based (length=stop-start+1). 6. probe_count (integer) Total number of probes in this probe set. 7. transcript_cluster_id (integer) Unique identifier for the transcript cluster containing this probe set. 8. exon_id (integer) Unique identifier for the exon cluster containing this probe set. 9. psr_id (integer) Unique identifier for the probe selection region containing this probe set. 10. gene_assignment (multipart) Gene name(s) for each assigned mRNA for mRNAs that corresponds to known genes. Sub-fields: a. accession - public sequence identifier for mRNA b. gene symbol - gene name if mRNA corresponds to a known gene 11. mrna_assignment (multipart) Public mRNAs that should be detected by this probe set based on sequence alignment. Sub-fields: a. accession - public sequence identifier for mRNA b. assignment seqname - name of the genomic sequence to which the public mRNA assigned to this probe set aligns. Could potentially be different from the design-time seqname (column #2) if probes in this probe set match an mRNA sequence which aligns to a highly similar sequence on different chromosomes. The assignment seqname will be "na" in cases where the mRNA could not be aligned to the genome. c. assignment score - (direct probes/possible probes) * 100. A null assignment score occurs for mRNAs that could not be aligned to the genome, yet could be aligned to probes. d. direct probes - number of probes in this probe set that align directly and completely to this mRNA. Used for computing assignment score. e. possible probes - number of probes within this probe set whose genomic location lies within the mRNA's genomic alignment. Used for computing assignment score. Also used for computing the assignment coverage (see Transcript CSV annotation section). A null value occurs for mRNAs that could not be aligned to the genome. f. xhyb - boolean indicator for whether this assignment is considered a cross-hybridization. See above for cross-hyb calling criteria. 12. probeset_type (integer) Cross-hybridization type of the probe set, predicted based on computational sequence alignment. Possible values: 1 = unique - All probes in probe set hybridize to a single genomic position 2 = similar - All probes in probe set hybridize to multiple genomic positions; all probes in the probe set still hybridize to the same set of transcribed genomic regions. 3 = mixed - There are inconsistent cross-hybridization properties amongst the probes in the probe set; the set of hybridizing genomic positions can vary among the probes. ** Note that these definitions for probeset type are relative to exon array designs and may not be true for other types of designs. 13. number_independent_probes (integer) Number of probes within probe set that overlaps no more than 13 bases with another probe in the set. (Note that two probes that overlap more than 13 bases are counted as one.) 14. number_cross_hyb_probes (integer) Number of probe within probe set that align to more than one genomic location. 15. number_nonoverlapping_probes (integer) Number of probe within probe set that do not overlap another probe. (Note that a set of overlapping probes will be counted as 1.) 16. level Level of design-time annotation support for the probe set. Possible values: Core - Supported by RefSeq, putative full-length mRNA, and Vega. Extended - Other cDNA support (ESTs, partial mRNAs, syntenic mouse and rat mRNAs) and Ensembl support. Full - Supported by gene predictions only (ie GENSCAN, sgp, geneid, exoniphy). Free - Designed against annotations which were merged such that no single annotation (or evidence) contains the probeset. Ambiguous - Cannot be unambiguously assigned to a particular transcript cluster. 17. bounded (boolean) Probesets are grouped into transcript clusters based on spliced annotations which share splice sites and single exons which have overlapping exonic sequence. Remaining single exon content may be grouped into a transcript cluster if it is bounded by it (ie falls within an intron). This flag indicates such inclusion by bounding. Bounded probesets have lower confidence of correct transcript cluster placement. 18. NoBoundedEvidence (boolean) Indicates probesets which are not fully contained within any single annotation. For example, two EST sequences aligned to the genome may infer the following single exon transcribed region: ================================================== Genome ---------------- EST 1 ----------------- EST 2 The result may have been a probe selection region (PSR) and probeset as follows: --------------------------- PSR -- -- -- -- Probes The result is that neither EST fully contains this probeset. Hence no annotations (ie ESTs in this case) are associated with the probeset. 19. has_cds (boolean) Indicates whether or not the probe set falls within a coding sequence of a design-time annotation. 20. fl: putative full-length mRNA from RefSeq and GenBank (integer) 21. mrna: putative partial mRNAs from GenBank (integer) 22. est: expressed sequence tags (integer) 23. vegaGene: public gene set (integer) 24. vegaPseudoGene: public gene set (integer) 25. ensGene: public gene set (integer) 26. sgpGene: public gene set from ab initio gene predictor (integer) 27. exoniphy: public exon set from ab initio exon predictor (integer) 28. twinscan: public gene set from ab initio gene predictor (integer) 29. geneid: public gene set from ab initio gene predictor (integer) 30. genscan: public gene set from ab initio gene predictor (integer) 31. genscanSubopt: public exon set from ab initio gene predictor (integer) 32. mouse_fl: putative full-length mRNAs from RefSeq and GenBank aligned to the mouse genome and mapped onto human genome using genome synteny maps (integer) 33. mouse_mrna: putative partial mRNAs from RefSeq and GenBank aligned to the mouse genome and mapped onto human genome using genome synteny maps (integer) 34. rat_fl: putative full-length mRNAs from RefSeq and GenBank aligned to the rat genome and mapped onto human genome using genome synteny maps (integer) 35. rat_mrna: putative partial mRNAs from RefSeq and GenBank aligned to the rat genome and mapped onto human genome using genome synteny maps (integer) 36. microRNAregistry: public micro RNA data set (integer) 37. rnaGene: public set of structural RNA genes (integer) 38. mitomap: public set of mitochondrial annotations (integer) Columns 20-38 indicates the number of these different categories of design-time annotation used as supporting evidence for the creation of the probe set. III. Transcript CSV file: HuEx-1_0-st-transcript-annot.csv ---------------------------------------------------------- A. Content Description Each line contains information for a single transcript cluster ID. The information consists of both design-time data, mRNA assignments for any public mRNAs that should be detected by the probe sets within the transcript cluster based on computational sequence alignment analysis, and functional information about those public mRNAs, if available. Lines are sorted in ascending order by transcript cluster ID, which corresponds closely with genomic location. B. Column Descriptions 1. transcript_cluster_id (integer) Unique identifier for the transcript cluster. 2. probeset_id (integer) Identical to the transcript cluster id. Required for compatibility with ExACT software. 3. seqname 4. strand (+|-) 5. start (integer) 6. stop (integer) Columns 2-5 contain the name and genomic location of the sequence from which the transcript cluster was designed, on the assembly indicated at the top of this file. Coordinates are standard 1-based (length=stop-start+1). 7. total_probes (integer) Total number of probes contained by this transcript cluster. d. gene symbol - gene name if mRNA corresponds to a known gene f. gene title - description of gene product g. cytoband - cytogenetic location of gene h. entrez gene id - Entrez Gene database identifier 8. gene_assignment (multipart) Gene information for each assigned mRNA for mRNAs that corresponds to known genes. Sub-fields: a. accession - public sequence identifier for mRNA b. gene symbol - gene name if mRNA corresponds to a known gene c. gene title - description of gene product d. cytoband - cytogenetic location of gene e. entrez gene id - Entrez Gene database identifier 9. mrna_assignment (multipart) Description of the public mRNAs that should be detected by the sets within this transcript cluster based on sequence alignment. Sub-fields: a. accession - public sequence identifier for mRNA b. source_name - Name of public repository containing the mRNA (Genbank, RefSeq, or Ensembl) c. description - description of mRNA from source repository d. assignment seqname - name of the genomic sequence to which the public mRNA assigned to this transcript cluster aligns. Could potentially be different from the design-time seqname (column #2) if probes in this transcript cluster match an mRNA sequence which aligns to a highly similar sequence on different chromosomes. The assignment seqname will be "na" in cases where the mRNA could not be aligned to the genome. e. assignment score - (direct probes/possible probes) * 100. A null assignment score occurs for mRNAs that could not be aligned to the genome, yet could be aligned to probes. f. assignment coverage- (possible probes/total probes) * 100. A zero value can occur for large transcript clusters where the number of possible probes is less than 1% of the total. g. direct probes - number of probes in this transcript cluster that align directly and completely to this mRNA. Used for computing assignment score. h. possible probes - number of probes within this transcript cluster whose genomic location lies within the mRNA's genomic alignment. Used for computing assignment score. Also used for computing the assignment coverage. A null value occurs for mRNAs that could not be aligned to the genome. i. xhyb - boolean indicator for whether this assignment is considered a cross-hybridization. See above for cross-hyb calling criteria. 10. SwissProt SwissProt information for each assigned mRNA that can be mapped to one or more SwissProt records. Note that the Swiss-Prot accession not may be associated with the actual mRNA accession listed but rather to an mRNA that has been clustered with the given mRNA by the NetAffx annotation pipeline. Sub-fields: a. accession - public sequence identifier for mRNA b. SwissProt accession - Swiss Prot accession number 11. UniGene UniGene information for each assigned mRNA that can be mapped to one or more UniGene records. Sub-fields: a. accession - public sequence identifier for mRNA k. UniGene id - Unigene identifier for associated UniGene cluster l. UniGene expr - UniGene tissue expression information 12. GO_biological_process (multipart) 13. GO_cellular_component (multipart) 14. GO_molecular_function (multipart) Fields 12-14 contain annotations from the three branches of the Gene Ontology for mRNAs that have GO annotations. Sub-fields: a. accession - public sequence identifier for mRNA b. GO id - GO identifier for the GO term c. GO term - GO term text d. GO evidence - GO evidence code description (not code) 15. pathway (multipart) GenMAPP pathway information for assigned mRNAs that can been annotated with GenMAPP data. Sub-fields: a. accession - public sequence identifier for mRNA b. source - name of the data source c. pathway name - name of the pathway 16. protein_domains (multipart) Pfam and SCOP protein domain information for mRNAs that can be annotated by these data sources. Sub-fields: a. accession - public sequence identifier for mRNA b. source - name of the data source c. accession or domain name - accession (Pfam) or name of the domain (SCOP) d. domain description - description of the domain 17. protein_families (multipart) Enzyme Classification and Hank's Kinase protein family association for mRNAs that can be annotated by these data sources. Sub-fields: a. accession - public sequence identifier for mRNA b. source - name of the data source c. family accession - accession number for family d. family description - description of the family IV. CSV File Version History ---------------------------- The version numbers listed here refer to the header line labeled "#%netaffx-annotation-csv-version" in the CSV files. Version 1.7, 2006-07-20 * Updated assignments based on NetAffx annotation release 20 (July 2006). * Changed formatting of assignment statistics section: Now tab-delimited. Version 1.6, 2006-04-19 * Updated assignments based on NetAffx annotation release 19 (April 2006). * Added more assignment statistics: Probe set statistics are now broken out by evidence level (core, extended, full) and the amount of likely cross hybridization assignments are indicated for probe sets and transcript clusters. Version 1.5, 2006-02-16 * Fixed bug in probe set CSV file. Data fields for probe sets not assigned to any public mRNA were missing or incorrect (e.g., columns 2 and 3 contained the genomic start and stop instead of seqname and strand as documented here). Data fields for probe sets assigned to public mRNAs were not affected. * Fixed bug in transcript cluster CSV file. A very small fraction of rows contained an internal newline character in the protein_domains field, leading to rows with incomplete data and additional bogus rows. * Changes to readme: Added copyright notice, removed approximate filesizes, minor re-wording for consistency. Version 1.4, 2006-01-12 * Based on updated design-time annotations (as described in the mid-January announcement for this array, see http://www.affymetrix.com/analysis). * Updated assignments based on NetAffx 2005 Q4 annotation update. * Added probeset_id column to transcript cluster CSV file. Contains transcript_cluster_id, not exon level probe set id, and is required for compatibility with ExACT software. * Fixed error in the total_probes and the assignment coverage statistics for transcript clusters. * Added missing Swiss-Prot accession cross references for a significant fraction of transcript clusters with mRNA assignments. * Changed ranking of assigned mRNAs by data source, which affects the order in which data for different assigned mRNAs are listed for both transcripts and probe sets that have multiple mRNA assignments. RefSeq, Ensembl-Transcript, GenBank, Ensembl-EST, Ensembl-Prediction * Fixed ordering of data for multiple assigned mRNAs for the other transcript cluster annotation fields to be consistent with the ordering in the mrna_assignment and gene_assignment columns. * Added more detail to certain column descriptions in the README. Version 1.3, 2005-09-20 * Initial public release * Updated assignments based on NetAffx 2005 Q3 annotation update. * Added new subfields to mrna_assignment: - actual probes - possible probes - xhyb * Added documentation about assignment scoring. * Small change in ordering of probeset columns. * Fixed minor bug in probe set data (psr_id and assignment seqname were swapped in the last row). Version 1.2, 2005-08-21 * Early Access release * Posted to allexon-beta website * Contains all probe sets and transcript clusters, both assigned and unassigned (to public mRNAs) * Includes some missing mRNA assignments * Column order changes pertaining to both probe set and transcript cluster CSV files: - Removed columns "assign_seqname" and "design_seqname" in favor of just one "seqname" column corresponding to design_seqname in the previous release. - Added column gene_assignment as 7th col, for gene-related information per assigned mRNA. - Moved mrna_assignment column from #2 to #8 - Ranked assigned mRNAs according to data source: RefSeq, GenBank, Ensembl, EST and gene-prediction based - Suppressed unnecessary output of null assignment data. * Column order changes specific to probe set CSV file: - mrna_assignment subfields are now: accession // assign_seqname // assign_score * Column order changes specific to transcript cluster CSV file: - mrna_assignment subfields are now: accession // source // description // assign_seqname // assign_score // assign_coverage - Moved UniGene and SwissProt data for assigned mRNAs out of the mrna_assignment column and into separate columns (9 and 10, respectively) to capture potential many-to-one relationships. Version 1.1, 2005-08-14 * Early Access version 2 release (EAv2) * Used for BioConductor workshop; not posted to allexon-beta website * Contains all probe sets and transcript clusters, both assigned and unassigned (to public mRNAs) * Has columns "assign_seqname" and "design_seqname" Version 1.0 -- 2005-06-17 * Early Access version 1 release (EAv1) * Posted to allexon-beta website * Contains only probe sets and transcript clusters that were assigned to public mRNAs