Viewing n-gram List of the Document (Menu Analysis \ Tools for Analysis)

An n-gram is a string of n distinct characters. 2-grams and 3-grams are called bi- or digrams and trigrams resp. 1-gram lists are called histograms.

The n-gram list of a document contains all n-grams of the document together with their frequency, usually ordered descendingly by frequency. CrypTool limits n-gram lists to the 5000 most frequent n-grams.

If you analyse a text document then only characters from the current alphabet (see menu Options \ Text options) are considered. Characters that do not belong to the current alphabet, will "separate" the text. For example: if the space character does not belong to the current alphabet, then the text "ATTACK AT DAWN" have the trigrams ATT, TTA, TAC, ACK, DAW and AWN.

The n-gram analysis of binary files will consider all 256 different bytes.

Example:

The 5 most frequent trigrams of the reference file genesis-en.txt are:

THE
5.445 %
AND
5.375 %
HER
1.275 %
HIS
1.000 %
HAT
0.892 %

The n-gram analysis can be computed in the n-gram list dialog.

This list can optionally be saved as a text file.