INTRODUCTION
The search for homologous substrings in a DNA or protein database is arguably the most important problem for bio-informatics (Meidanis and Setubal, 1997). It is a necessary step in the reconstruction of large genomes from the raw data generated by DNA sequencing equipment, and it is the main tool for identifying the function and evolutionary history of genes.
Homology search can be formally modeled as follows: the databank is a set of sequences over some alphabet S, and the goal is to find the substrings of those sequences that match a given string, within a given similarity criterion. The expected number of substrings that match by a random query s in a random databank is proportional to the number, N, of databank substrings times the probability p that the query s matches one of those sequences. The value of p depends on the statistical distributions of the queries, of the databank strings, and of the similarity criterion.
Useful information content
The match probability p can be expressed by the Equation I = -log2 p, which may be called the useful information content of the query string; it is expressed as a number of binary digits or bits. To justify this nomenclature, consider the abstract examples below. In all cases, both the query s and the databank entries are random strings of n = 20 binary digits, with the same distribution.
A. The string bits are uniformly and independently drawn from {0,1}, and two strings are considered similar if and only if their last 15 bits are identical. Clearly, the match probability p is 2-15, which implies I = 15; this is precisely the number of bits in the query that are useful for the search.
B. The string bits are uniformly and independently drawn, and two strings are considered similar if they differ by at most a single bit at any position. Then the match probability is p = 2-n + n 2-(n-1) = 41 · 2-20, which means I = 20 - log241 » 14.64. That is, because of the tolerated errors, the 20-bit query string only determines about 14.64 bits of the matching entries.
C. The string bits are independently drawn, with probabilities p0 = 1/3 for 0 and p1 = 2/3 for 1, and the match tolerance is at most one bit in any position, as in the previous example. Then the probability of a match is p = (p02 + p12)n + 20 (2 p0 p1) (p02 + p12)n-1 = (5/9)19(5/9 + 80/9) = 17 · (5/9)20, which means I = 20log2(9/5) - log2(17) » 12.87. That is, because of the biased bit probabilities, the 20 bits of the query contain less than 13 bits of useful information about the desired sequences.
D. The first bit of each string is 0 or 1 with equal probability, but each succeeding bit is equal to the previous one with probability p0 = 3/4, and it is different from it with probability p1 = 1/4, independently of all other bits. As above, similarity means at most one bit difference. Note that each bit, considered by itself, is 0 or 1 with equal probability; but we cannot apply the estimate of example B because the bits are not independent. To overcome this obstacle, we apply the following transformation to the strings: keep the first bit, and replace every subsequent bit by its absolute difference with the previous one. This transformation makes the bits independent of each other: the first bit is uniformly distributed, and the others are 0 or 1 with probabilities p0 and p1, respectively. Note that two original strings are similar if and only if the first k bits of their transforms are identical, and the remaining n-k bits are complementary, for some k in {0,…, n}. It follows that the match probability is

where r = (p02 + p12)/(2p0 p1) = 5/3. Therefore, p = (1/4)(3/8)19(1 + 3((5/3)20-1)/2) and I » 13.56.
Examples A-D (above) illustrate the difficulties we must deal when computing the match probability p for a DNA similarity search. As in example C, in a “random” DNA sequence (as drawn from a gene bank or produced by a sequencing machine) the bases {A, T, C, G} are known to occur with different probabilities. Moreover, because of the way that the genetic code works, we know that there is some dependency between adjacent bases, and between bases whose positions differ by a multiple of 3, especially within active genes. Long-range dependencies have also been suspected; for example, the frequencies of the four bases seem to vary along each chromosome in a fractal-like pattern, possibly as a consequence of gene duplication and/or physico-chemical constraints in the encoded proteins. For the same reasons, short- and long-range dependencies may exist between the mutations that distinguish the response to query s from the matching bank entries.
All these complications prevent us from computing p (or I) directly. To overcome these difficulties, we followed the approach used in example D; namely we transformed the sequences in such a way that its components become independent, and express the similarity criterion in terms of the transformed sequences.
The transformation we use is an encoding of the bases as complex numbers, followed by a Fourier transformation. Although the Fourier coefficients cannot be proven to be independent, this hypothesis is supported by both intuition and by statistical tests. An additional advantage of this approach is that it yields the amount of information contributed by each Fourier component of the sequence. This analysis is important for predicting the performance of the fast multiscale signal matching technique, proposed by Leitão and Stolfi (2002), which may be much faster than the matching algorithms that are currently in use.
DISCRETE FOURIER TRANSFORM AND POWER SPECTRUM
Let us consider a sequence x(n) represented in a time domain, where n assumes a range of N integer values. The discrete Fourier transform applied to this signal permits us to represent it in the frequency domain, as follows:

The spectral components of a power spectrum are defined as |X(k)|2, where the k values, ranging between 0 and N-1, are the frequencies that represent x(n).
INTERPRETING DNA AS SIGNALS
Before we can apply the tools of information theory to this problem, we must convert the DNA strings into signals - sequences of numerical elements. So, we opted for the compromise encoding used by Cheever et al. (1989), where each base is represented by a complex number. Specifically, A, T, C, and G are mapped to +1, -1, +i, -i, where I = Ö(-1) is the imaginary unit. Note that each purine base and the corresponding pyrimidine base are mapped to complementary values. This encoding has the advantage that it produces a complex signal, which is a well-studied object in signal processing theory, and it is the canonical input for the Fourier transform.
Power spectrum of DNA
Figure 1 shows the spectrum of two DNA sequences, encoded as described in the “Interpreting DNA as signals” section.

Note that the power spectrum of prokaryote DNA has a strong peak centered at frequency k = n/3, which corresponds to a spectral component with period 3. This peak is due to the asymmetry and non-uniform probability distribution of the codons (the three-base codes for amino acids). Anasstassiou (2002) investigated the use of this phenomenon to detect protein-coding regions in genomes. The peak spreads to the neighboring frequencies, presumably because occasional point deletions and insertions cause the period to fluctuate a little around 3.
The peak at k = n/3 is much weaker in the second spectrum, because in eukaryote DNA the coding regions are interrupted by numerous non-coding segments (introns). These are almost random sequences, which tend to have a flat power spectrum. Besides being diluted by the introns, the coding regions are also displaced by random amounts, not necessarily multiples of 3; these random shifts have the effect of broadening the peak.
Independence of the Fourier coefficients to the DNA signal
As explained in the “Useful information content” section, the main obstacle to computing the useful information contents of DNA for a similarity search is the definite but largely unknown dependence between nearby bases. We argue that the Fourier transform allows us to overcome that obstacle, because the real and imaginary parts of the Fourier coefficients are all independent random variables. The independence of the Fourier coefficients seems to be confirmed by experiments with DNA strings extracted from the GenBank database (see Figure 2).

This assertion cannot be proven mathematically, since one can construct distributions of DNA strings that violate it. However, it is supported by the observation that the real and imaginary parts of the Fourier basis functions are pairwise orthogonally. In fact, for every pair of distinct samples si and sj that contributes to distinct coefficients Sl and Sk, there is another pair si+r, sj+r that contributes to Sl with the same weights, and to Sk with opposite weights. Therefore, if the suspected correlation between si and sj depends only on the distance i-j (as should be the case for randomly clipped substrings of a genome), then these correlations cancel out and do not give rise to correlations between the Fourier coefficients.
INFORMATION THEORY BASICS
Entropy
Let X be a random variable that can assume values x1,…, xm with probabilities p1,…, pm. The entropy or expected information contents of X is

Conditional information
Suppose A and B are two random variables, with values {a1,…, ap} and {b1,…, bq}, respectively. When we know that A has a particular value ai, the information that we gain about B is:

where H(B|A = ai) is the result of applying Equation 3 to the conditional probability distribution Pr(B = bj|A = ai).
The average information carried by A about B is the expected value of I(B|A = ai) averaged over all ai, that is

where

Mutual information of a Gaussian variable
An important special case that matters to us is when A = S + N and B = S + P, where S, N and P are independent complex variables with symmetric Gaussian distributions and variances
,
, and
. We may think of S as a “message” from which one makes two independent copies, which get corrupted by “noises” N and P. Our goal is determine how much information A gives about B, on average. In this case, it turns out that A and B also have symmetric Gaussian distributions, with variances
=
+
and
=
+
, respectively. So the first term of Equation 5 is simply

Moreover, the conditional distribution Pr(S » x|A = y) turns out to be another Gaussian distribution, with mean y
/
and variance 
/
. Since P is independent of N and S, the conditional distribution of B = S + P, given A = y is also Gaussian, with the same mean y
/
and variance
/
/
+
. Note that this value does not depend on y; so Equation 6 reduces to

According to Equations 5, 7 and 8, the information given by A about B is then

The Shannon-Hartley formula
In particular, if we set N = 0 (that is, B = S) we get the Shannon-Hartley formula

which gives the amount of information carried by the corrupted message A about the original message S (Lathi, 1968).
Gaussian signals
The formula for mutual information is greatly simplified in the case of Gaussian signals, signals with Fourier coefficients that are random independent variables with symmetric Gaussian distributions on the complex plane. Many natural signals fit this model.
Let Ak, Bk, Sk,
k, and
k be the Fourier coefficients of a, b, s,
, and
, respectively. Let us assume that the coefficients Sk,
k, and
k are independent random variables with symmetric, zero-mean Gaussian distributions over the complex plane. Let us assume also that
k, and
k have the same variance
k. Then, the information given by each coefficient Ak about the corresponding coefficient Bk (Lathi, 1968) turns out to be

The total information about b carried by a is then simply

INFORMATION CONTENT OF DNA
We now apply this theory to estimate the useful information content of a DNA sequence for the purpose of finding homologs in a DNA bank. Note that two moderately long homologous DNA sequences are always descendants from a common ancestor.
We can view the DNA sequence abstractly as a signal corrupted by noise (differences between bases). Specifically, two homologous sequences can be written as a(t) = s(t) +
(t) and b(t) = s(t) +
(t), where s is the ancestral sequence, and
,
are “noise” functions that represent mutations or lost bases.
Determining
k and
k
Unfortunately, we have no direct information about the variance of the original signal
k (the genomic sequence of the mutual ancestor) or of the noise
k (the difference between the sequences caused by mutations or deletions). Let us then denote m as the average of the two signals a, b, and d, their difference. Then the Fourier coefficients Mk and Dk of signals m and d have variances


Thus, given a sample of homologous DNA chains, we can compute the variances
k and
k, and then estimate the variances
k and
k by the formulas

Therefore, by Equation 11, the amount of information contained in the frequency-k component of sequence a about the same component of its partner b is

Since
k =
k +
k
(Equation 15) could be rewritten as

RESULTS
To test this theory, we used 48 pairs of homologous sequences of prokaryote DNA taken from GenBank, selected to have at least 90% identity within each pair. Figure 3 shows the estimated variances
k, and
k and the useful information contents Ik for each component frequency k, as computed by Equation 16. The peak obtained at frequency N/3 was expected from the corresponding peak in the mean power spectrum
k. Note that, apart from this peak, the useful information is distributed almost uniformly over all frequencies.
By adding the information contents (Equation 16) of all coefficients, we obtain an estimate of 1.70 bits of useful information for homology matching. This number can be compared to the upper limit of 1.99 bits/base for the information contents of DNA, derived from Equation 3 under the assumption that each base was chosen independently, according to its observed frequency.
CONCLUSIONS
We have described a method for estimating the amount of information contained in a DNA sequence that can be used to identify homologous blocks, in spite of mutations and acquisition errors. This parameter (the mutual information content) allows us to estimate the probability of false positives - strings that are not homologous to the given sequence, but are just as similar to it as the homologous ones.
ACKNOWLEDGMENTS
Research supported partially by the Brazilian agencies CNPq and CAPES. We thank Gabriel Landini, pioneer in Fourier analysis of DNA, for motivating us to pursue this line of research.
REFERENCES
Anasstassiou, D. (2002). Digital Signal Processing of Biomolecular Sequences. Technical Report, CU/EE/TR2000-20-042. Department of Electrical Engineering, Columbia University, NY, USA.
Cheever, E.A., Karunaratne, W., Searls, D.B. and Overton, G.C. (1989). Using signal processing techniques for DNA sequence comparison. Proceedings of the Northeast Bioengineering Conference, Boston, MA, USA, pp. 173-174.
Lathi, B.P. (1968). Communication Systems. John Wiley & Sons, New York, USA.
Leitão, H.C.G. and Stolfi, J. (2002). A Multi-Scale Method for the Re-Assembly of Two-Dimensional Fragmented Objects. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 9: 1239-1251.
Meidanis, J. and Setubal, J.C. (1997). Introduction to Computational Molecular Biology. PWS Publishing Company, Pacific Grove, CA, USA.