< | > | glossary|All Topics

NetAffx™ Analysis Center

index | previous question | next question

What procedure is used to generate the genome alignment data?

Genome alignments are currently provided for Human/Mouse/Rat/Drosophila/Celegans arrays. We align the target sequences against the genome sequence downloaded from the UCSC website using BLAT. While some of the target sequences do not align, perhaps due to the draft nature of several genomes, some targets align at multiple locations on the genome. We apply a filter to select the best hit for each target sequence. We use the following procedure:

  1. calculate a score for each alignment as follows:
    score=matches - (mismatches+5*qbaseinsert)
    where matches = number of bases that match (including both repeat and non-repeat regions)
        mismatches = number of bases that do not match in the alignment
        qbaseinsert = number of bases inserted in the query

    It is therefore possible that some of the scores are negative.

    Pcgood=score*100/target size

    The pcgood metric is provided on the web site and in the download files.
  2. Select the alignment with the best score.
  3. Derive genomic coordinates for the probes (25-mers) from the "best" target sequence alignment.

We use the genomic coordinates for each probe (25-mer) from above and search for transcripts (RefSeq and GenBank mRNA alignments to the genome from UCSC genome database) that overlap with the alignment of the probes. In the NetAffx summary report, we provide the transcript whose genomic alignment overlaps with the maximum number of probes from that particular probe set. We also provide the total number of probes from the probe set that overlap with the transcript as measure of the ability of the probe set to detect the corresponding transcript.