Parallel corpora

Korpusomat allows you to create parallel corpora — collections in which texts in different languages are aligned at the sentence level. Each sentence in one language corresponds to a sentence (or group of sentences) in another language. This makes it possible to search for equivalents of words, phrases and constructions between languages.

Supported file formats

Text files can be added in two ways:

  • TMX - Already includes sentences paired in two (or more) languages. Korpusomat automatically reads these pairs and keeps them linked.

  • CSV - Each row in the CSV file corresponds to one pir (or group) of sentences in different languages.

Preparing a CSV file for a parallel corpus:

The first row must contain UTF-8 language codes (e.g., pl, en, de) — these are the column headers. Each subsequent row is one sentence alignment — one sentence per language. Example (for two languages: Polish and English):

pl,en
Niestety nadal nie jest dostępna szczepionka przeciwko wirusowemu zapaleniu wątroby typu C.,Vaccination against hepatitis C is not yet available.
Zakażenie wirusem HIV,HIV infection
Ludzki wirus upośledzenia odporności (HIV) powoduje jedną z najważniejszych chorób zakaźnych w Europie.,The human immunodeficiency virus (HIV) remains one of the most important communicable diseases in Europe.
  • TXT - It is possible to upload two separate text files, one for each language.

  1. Files with corresponding lines: each line corresponds to one sentence and the lines in both files are paired.

  2. Files with non-corresponding lines: for example a book written in Polish and its translation into English.

  3. The files have no links between sentences. In this case the sentences will be matched by the Hunalign tool, which compares the length of sentences and uses bilingual dictionaries to match to corresponding sentences in both languages.

Instructions

When creating a corpus select the Parallel corpus checkbox and specify the languages of the uploaded texts.

image1

Next, choose either a previously aligned file (ADD ALIGNED TEXT (TMX, CSV)) or separate text files (ADD TEXTS TO ALIGN).

image2 image3

If in the files being added each line in one file corresponds to a line in the other select the checkbox Lines in both files correspond to each other..

image4 image5

Tips for best results

For the most accurate alignment use TMX files or files where lines correspond directly.

Avoid very long paragraphs in a single line. Splitting text into sentences improves alignment accuracy.