An n-gram is a string of n distinct characters. 2-grams and 3-grams are called bi- or digrams and trigrams resp. 1-gram lists are called histograms.
The n-gram list of a document contains all n-grams of the document together with their frequency, usually ordered descendingly by frequency. CrypTool limits n-gram lists to the 5000 most frequent n-grams.
If you analyse a text document then only characters from the current alphabet (see menu Options \ Text options) are considered. Characters that do not belong to the current alphabet, will "separate" the text. For example: if the space character does not belong to the current alphabet, then the text "ATTACK AT DAWN" have the trigrams ATT, TTA, TAC, ACK, DAW and AWN.
The n-gram analysis of binary files will consider all 256 different bytes.
Example:
The 5 most frequent trigrams of the reference file genesis-en.txt
are:
|
|
|
|
|
|
|
|
|
|
The n-gram analysis can be computed in the n-gram list dialog.
This list can optionally be saved as a text file.