Notes

1 https://www.newseye.eu/ .

2 This estimation is based on the period 1771–1910. Pletschacher, Clausner and Antonacopoulos (2015) report a word accuracy of 67.5% for the Finnish part of the Europeana newspaper collection. This estimation is based on a selection of about 132 000 pages included in the Europeana data set.

3 About half of the collection is in Swedish, the second official language of Finland and up till about 1890 the main publication language of newspapers and journals. We have not estimated the quality of the Swedish data as thoroughly as quality of the Finnish data, but it seems that quality of the Swedish data is worse than quality of the Finnish data.

4 https://www.abbyy.com/en-eu/finereader/ .

5 “In digital imaging and OCR, ground truth is the objective verification of the particular properties of a digital image, used to test the accuracy of automated image analysis processes. The ground truth of an image’s text content, for instance, is the complete and accurate record of every character and word in the image. This can be compared to the output of an OCR engine and used to assess the engine’s accuracy, and how important any deviation from ground truth is in that instance.” https:// www.digitisation.eu/tools-resources/image-and-ground-truth-resources/ . Cf. also Märgner and El Abed (2014) and Carrasco, (2014).

6 This version was produced by the subcontractor when the GT data was formed.

7 The original data has 500 640 words. Parallelization of the different OCR versions of the data has proven hard, and we use the results of 471K of data that has content for every different OCR version.

8 https://sites.google.com/view/icdar2017-postcorrectionocr/dataset .

9 https://github.com/flammie/omorfi .

10 https://github.com/jiemakel/omorfi , Mäkelä (2016).

11 Variation of w and v is one of the main differences between 19th century and modern Finnish spelling. W was used much more in 19th century, in modern Finnish it is used mostly in foreign names (e.g. Wagner).

12 Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965. See https://en.wikipedia.org/wiki/ Levenshtein_distance .

13 “It is impossible to correct very noisy texts, where the nature of the noise is random and words are distorted by a large edit distance (say 3 or more).”

14 https://transkribus.eu/Transkribus/ .