Calculating the autocorrelation of a document (Menu Analysis \ Tools for Analysis)

The autocorrelation of a document is an index of the similarity of different sections of the document. It is sometimes possible to work out the key length of an encrypted document from its autocorrelation.

Correlation

The statements about Correlation are valid here for a independently, uniformly distributed, binary random sequence:

The similarity between two sets of data is normally measured by their correlation. Correlation C between two sequences of length n is calculated from the number A of agreeing and the number D of non-agreeing sequence members according to the formula

C := (A-D) / n

According to this formula, the expected value of the correlation between two independent random sequences is 0. On the other hand, two sequences which agree relatively often will have a positive correlation value. In the extreme case where the two documents are identical A=n, D=0 and therefore C=1. Conversely, if there are fewer than n/2 agreements, the correlation will be negative. The degree of correlation between two sequences is therefore a measure of their mutual dependence.

Autocorrelation

The autocorrelation function C(t) measures the similarity of sequence (s[i])=s[1] s[2]... to sequence (s[i+t])=s[t] s[t+1].... which is shifted by t places.

A sequence of length n is examined, and the following parameters are defined:

A(t) := number of members of sequences (s[i]) and (s[i+t]) in the segment under consideration which agree.

D(t) := number of members of sequences (s[i]) and (s[i+t]) in the segment under consideration which do not agree.

The autocorrelation function C(t) = (A(t)-D(t)) / n.

In the case of finite sequences, sequence s is shifted (acyclically) by t positions, so that sequence (s[i+t]) has only n-t sequence members (where n is the length of sequence (s[i]). To calculate the autocorrelation, one now considers the sequences s[1] s[2]... s[n-t+1] and s[t] s[t+1]... s[n] and calculates their correlation.

The autocorrelation function is used among other applications to work out the cycle length for automatic analysis of the Vigenère encryption algorithm and encryption by Byte Addition and the Exclusive-OR. For further information, see Examples chapter.