Corpus statistics

Frequency list

This option displays the frequency list of all lemmas in the corpus. By default, the list is limited to content words, but there is an option to include or exclude any part of speech by clicking on the button at the top of the page. Clicking on the “Download” button allows the user to download the full frequency list without the applied filters.

Measures of dispersion and adjusted frequency

Apart from the frequency list, adjusted word frequencies and dispersion measures can also be consulted.

The available measures are listed below. The following notation conventions have been adopted in the formulas below:

  • \(N\) – the number of texts in the corpus.

  • \(N_t\) - the number of texts containing a given word.

  • \(L\) – the number of words in the corpus.

  • \(L_i\) – the number of words in the i-th text of the corpus.

  • \(F\): – the number of occurrences of the word in the whole corpus.

  • \(F_i\) – the number of occurrences of the word in the i-th text of the corpus.

  • \(\bar{F}\) – the average number of occurrences of the word in a text: \(\bar{F} = \frac{\sum_{i=1}^{N}F_i}{N}\).

  • \(P_i\) – the ratio of the number of occurrences of the word in the i-th text of the corpus and the total number of words in the i-th text of the corpus.

  • \(\bar{P}\) – the average sation of the number of occurrences of the word and the total number of words in a corpus text: \(\bar{P} = \frac{\sum_{i=1}^{N}P_i}{N}\).

  • \(S_i\) – the ratio of the number of words in the i-th text of the corpus nad the total number of words in the corpus: \(S_i = \frac{L_i}{L}\)

Measures of dispersion and adjusted frequency:

  • Frequency – the number of occurrences of a word in the corpus.

    \[F\]
  • IDF – a measure of the amount of information related to the use of a given word. The value of IDF is equal to 0, if the word is present in all of the texts, and the lower the number of texts containing the word, the higher the value.

    \[\begin{split}\\log_2\frac{N}{N_t}\end{split}\]
  • Variation coefficient (vc).

    \[\sigma_F = \sqrt{\frac{\sum_{i=1}^N(F_i - \bar{F})^2}{N}}\]
    \[vc = \frac{\sigma_F}{\bar{F}}\]
  • Juilland’s D and U – measures of dispersion (D) and adjusted frequency (U). The values are calculated using an adjusted formula, which takes differences in text sizes into account. The values of D fall within the range of \([0,1]\), whereas the values of U fall within the range of \([0,F]\).

    \[\sigma_P = \sqrt{\frac{\sum_{i=1}^N(P_i - \bar{P})^2}{N}}\]
    \[D_{adj} = 1 - \frac{\sigma_P}{\bar{P}\sqrt{N - 1}}\]
    \[U = D_{adj} \cdot F\]

    *Note that the standard deviation values used to calculate Juilland’s D and U are calculated using different formulas than for the variation coefficient.

  • Gries’ DP (reverted DP, normalized DP) – measures of dispersion. Except for DP, values of reverted DP (\(DP_{rev}\)) (which can be easily compared to other measures) and normalized DP (\(DP_{norm}\)) (which can be used to compare corpora consisting of different numbers of texts) are calculated. Higher values of DP and normalized DP (and lower values of reverted DP) represent a higher dispersion of a given word. The values of DP fall within the range of \([0, 1-min\left\{ S_i \right\}_{i=1}^{N}]\), the values of normalized DP fall within the range of \([0,1]\), and the values of reverted DP fall within the range of \([min \left\{ S_i \right\}_{i=1}^N, 1]\).

    \[DP = \frac{\sum_{i=1}^N | \ frac{F_i}{F}-S_i | }{2}\]
    \[DP_{rev} = 1 - DP\]
    \[DP_{norm} = \frac{DP}{1 - \frac{1}{N}}\]
  • Caroll’s D2 and Um – measures of dispersion (D2) and adjusted frequency (Um). D2 displays values from the \([0,1]\) range; the lower the values, the more uneven the distribution of the word. The values of Um fall within the range of \([\frac{F}{N},F]\).

    \[D_2 = \frac{- \sum_{i=1}^{N} \left( \frac{P_i}{\sum_{i=1}^{N} P_i} \cdot \log_2\frac{P_i}{\sum_{i=1}^{N}P_i} \right)}{\log_2{N}}\]
    \[U_m = F \cdot D_2 + (1 - D_2) \cdot \frac{F}{N}\]
  • KL Divergence (KLD) – non-symmetric measure of the difference in probability distributions, used as a dispersion measure. The measure has non-negative values, where higher values represent a more uneven distribution of the word. In the following formula it is assumed that \(\\log_2 0 = 0\).

    \[KLD = \sum_{i=1}^N \Bigg( \frac{F_i}{F} \cdot \log_2 \left( \frac{F_i}{F} \cdot \frac{1}{S_i} \right) \Bigg)\]
  • Rosengren’s S and AF – measures of dispersion (S) and adjusted frequency (AF). The values of the metrics are calculated using an adjusted formula, which takes differing text sizes into account. S displays values from the range \([\frac{1}{N}, 1]\), where higher values represent a more even distribution of the word in the corpus. The values of AF fall within the range of \([\frac{F}/{N}, F]\).

    \[S_{adj} = \frac{1}{F} \left( \sum_{i=1}^N \sqrt{F_i \cdot S_i} \right)^2\]
    \[AF = F \cdot S_{adj}\]
  • ARF – a measure of adjusted frequency based on the distances between the occurrences of a given word. In the formula below, the variable \(d_j\) stands for the distance between the j-th and j+1-th occurrence of the word (for \(j=F\) – the distance between the first and the last word, assuming that the distance between the first and the last word of the corpus is 1). The values of ARF fall within the range of \([1,F]\) (higher values represent a more even distribution of the word).

    \[ARF = \frac{F}{L} \sum_{i=1}^F min \left\{ d_i, \frac{L}{F} \right\}\]

For corpora consisting of only one text, only frequency and ARF are calculated.

The measures of dispersion are described in more detail in „Dispersions and adjusted frequencies in corpora” (Gries, 2008) and in chapter 5 of the handbook „A Practical Handbook of Corpus Linguistics” (Gries, 2021).

Terminology

Terminology is generated through the TermoPL application, where you can also find its detailed description, instructions, and additional information.

The information generated by the Terminology functionality is limited to base forms, C-values, and numbers of occurrences, sorted by C-value. After clicking on the “Download” button, a txt file containing all the data generated by TermoPL is downloaded to the user’s computer.

The files generated by Korpusomat are compatible with the TermoPL application - after downloading the “corpus source files” (see the main corpus panel), you can run the TermoPL application with the options of your choice.