||This data has been distributed in conjunction with the publication The
Effects of Alternative Splicing on Transmembrane Proteins in the Mouse
Genome, presented at PSB 2004. The data describes how
alternative splicing alters transmembrane and signal peptide protein
motifs in 8067 mouse proteins. This is nonredundant set derived
from all mouse cDNAs with reasonable genomic alignments.
As detailed in the paper, these proteins were grouped dynamically by
gene and splice variant according to the genomic alignment of the
associated cDNA sequence. The proteins actually assessed were
derived from the genomic translation of the GenBank CDS regions.
This was a deliberate measure to factor out the effects of genetic
variation. Thus, in some cases, the protein analyzed may differ
from the protein in the GenBank record. All proteins were applied
to TMHMM and SIGNALP to identify
putative transmembrane and signal peptide motifs, respectively.
The paper assesses how these protein annotations varied between
different splice variants of the same gene, and analyzes the relation
between these motifs and splice sites.
This data is available in RDF and N3 format, and is
organized as follows. The files all_genes.1.rdf to
all_genes.14.rdf contain the data on all the 8067 sequences and 6847
genes analyzed. Due to the size of this data, it was divided into
fourteen files by gene. The files multi_variant_genes.1.rdf
to multi_variant_genes.4.rdf contain a subset of that data; the 904
genes with multiple protein isoforms, with a total of 2118 sequences.
Finally, the file genes_with_differing_annotations.rdf reports on
the 138 genes for which the motifs differed between the protein
isoforms. These genes were associated with 396 sequences.
In this data, each splice variant is represented by one transcript.
The properties associated with each transcript include its protein
translation, its set of exons, and its set of annotations. The
exons are described by: genomic start and stop coordinates; whether
they're part of a CDS region; the CDS start and stop, if they are
not the same as the exon start and stop; and the translation frame, if
any. Each annotation is described by its set of protein spans:
ungapped motifs in the protein sequence. Each protein span is
associated with one or more genomic spans, ungapped regions of genomic
alignment. If the genomic coordinates of a protein span are not
divided by an intron, then the protein span will have one associated
genomic span. Finally, each exon and each annotation are
described as whether they're common to all variants of the gene.
For more information, please contact the authors of the paper.