<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article article-type="research-article" xml:lang="EN" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">LIBER</journal-id>
<journal-title-group>
<journal-title>LIBER QUARTERLY</journal-title>
</journal-title-group>
<issn pub-type="epub">2213-056X</issn>
<publisher>
<publisher-name>Uopen Journals</publisher-name>
<publisher-loc>Utrecht, The Netherlands</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">lq.10322</article-id>
<article-id pub-id-type="doi">10.18352/lq.10322</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Ground Truth OCR Sample Data of Finnish Historical Newspapers and Journals in Data Improvement Validation of a re-OCRing Process</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0003-2747-1382</contrib-id>
<name>
<surname>Kettunen</surname>
<given-names>Kimmo</given-names>
</name>
<email>kimmo.kettunen@helsinki.fi</email>
<xref ref-type="aff" rid="aff1"/>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0003-0471-314X</contrib-id>
<name>
<surname>Koistinen</surname>
<given-names>Mika</given-names>
</name>
<email>j.m.o.koistinen@gmail.com</email>
<xref ref-type="aff" rid="aff1"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kervinen</surname>
<given-names>Jukka</given-names>
</name>
<email>jukka.kervinen@helsinki.fi</email>
<xref ref-type="aff" rid="aff1"/>
</contrib>
<aff id="aff1">University of Helsinki, National Library of Finland, Finland</aff>
</contrib-group>
<pub-date pub-type="epub">
<month>2</month>
<year>2020</year>
</pub-date>
<volume>30</volume>
<fpage>xx</fpage>
<lpage>xx</lpage>
<permissions>
<copyright-statement>Copyright 2020, The copyright of this article remains with the author</copyright-statement>
<copyright-year>2020</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See <uri xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</uri>.</license-p>
</license>
</permissions>
<self-uri xlink:href="https://www.liberquarterly.eu/article/10.18352/lq.10322"/>
<abstract>
<p>The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of these about 7.64 million pages are freely available on the web site <ext-link ext-link-type="uri" xlink:href="https://digi.kansalliskirjasto.fi/etusivu">https://digi.kansalliskirjasto.fi/etusivu</ext-link>. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The last nine years, 1921&#x2013;1929, were opened in January 2018.</p>
<p>This paper presents briefly the ground truth Optical Character Recognition data of about 500,000 words that has been compiled at the NLF for development of an improved OCR process for the Finnish collection. We discuss compilation of the data generally and show results of the new OCR process in comparison to current OCR, using the ground truth data as an evaluation benchmark. We also show with real newspaper data of 30 years and 109 million words that the re-OCRing process is improving the quality of the OCRed data.</p>
</abstract>
<kwd-group>
<kwd>OCR quality</kwd>
<kwd>ground truth data</kwd>
<kwd>evaluation</kwd>
<kwd>measurement</kwd>
<kwd>Finnish historical newspapers</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1. Introduction</title>
<p>The National Library of Finland has digitized historical newspapers, journals and ephemera (small prints) published in Finland since the late 1990s. The digitized collection of NLF is part of a globally expanding network of historical data, produced by libraries that offers researchers and lay persons insight into the past. In 2012 it was estimated that there were about 129 million pages and 24,000 titles of digitized newspapers available in the web in Europe alone (<xref ref-type="bibr" rid="r6">Dunning, 2012</xref>). A very conservative estimation about the worldwide number of titles is 45,000 (<xref ref-type="bibr" rid="r25">The State of the Art, 2015</xref>). The number of currently available titles is probably much bigger, as the national libraries have been working steadily with digitization both in Europe, Northern America and the rest of the world.</p>
<p>Besides producing and publishing the digitized raw data all the time, the NLF has been involved in research and improvement of the digitized material during the last years. In September 2019 we ended a two year European Regional Development Fund (ERDF) project. NLF was also involved in the research consortium <italic>Computational History and the Transformation of Public Discourse in Finland, 1640&#x2013;1910</italic> (COMHIS) that was funded by the Academy of Finland (2016&#x2013;2019) and utilized the newspaper and journal data in its research of historical changes of publicity in Finland. We participate in and provide our data also for the EU project NewsEye<xref ref-type="fn" rid="fn1">1</xref> that started in May 2018.</p>
<p>One part of our data improvement effort has been the quality analysis of Finnish data. Out of this we have learned that about 70&#x2013;75&#x0025; of the words in the data are probably right and recognizable. In a collection of about 2.4 billion words<xref ref-type="fn" rid="fn2">2</xref> this means that 600&#x2013;800 million word tokens are wrong (Kettunen &#x0026; P&#x00E4;&#x00E4;kk&#x00F6;nen, 2016). This is a huge proportion of the words in the collection. The documents are shown to users as pdf files in the web presentation system, but also results of optical character recognition can be seen by the user in the user interface. We also provide the raw textual data as such for research use. OCR errors in the digitized newspapers and journals may have several harmful effects for users of the data. One of the most important effects of poor OCR quality &#x2013; besides lower readability and comprehensibility &#x2014; is worse on-line searchability of the documents in the collections. Also general usefulness and linguistic post-processing is harmed by OCR errors (<xref ref-type="bibr" rid="r8">J&#x00E4;rvelin, Keskustalo, Sormunen, Saastamoinen, &#x0026; Kettunen, 2016</xref>; <xref ref-type="bibr" rid="r17">Lopresti, 2009</xref>; <xref ref-type="bibr" rid="r27">Traub et al., 2016</xref>). Although users of the NLF collections have not complained about the quality much, its improvement is a natural first step in adding more value to the collection.<xref ref-type="fn" rid="fn3">3</xref></p>
<p>In order to fulfill this mission, we started to consider re-OCRing of the data in 2015. The main reason for this was that the collection had been OCRed with a proprietary OCR engine, ABBYY FineReader (v.7 and v.8). Newer versions of the software exist, the latest being 15.0,<xref ref-type="fn" rid="fn4">4</xref> but the cost of the Fraktur font for OCR is too high a burden for re-OCRing the collection with ABBYY FineReader. We ended up using the open source OCR engine Tesseract v. 3.04.01 and started to train Fraktur font for it. This process and its results are described in detail in <xref ref-type="bibr" rid="r14">Koistinen, Kettunen, and P&#x00E4;&#x00E4;kk&#x00F6;nen (2017)</xref>, <xref ref-type="bibr" rid="r15">Koistinen, Kettunen, and Kervinen (2018)</xref> and in <xref ref-type="bibr" rid="r12">Kettunen and Koistinen (2019)</xref>.</p>
<p>The rest of the paper is arranged as follows: section 2 introduces the data in the ground truth collection. Section 3 compares the results of the new OCR in the GT with the results of the current/old OCR using different measures and types of analysis. Finally, section 4 concludes the paper.</p>
</sec>
<sec id="s2">
<title>2. Data in the GT Collection</title>
<p>The main reason for setting up a re-OCRing procedure for a digitized text collection is usually bad or mediocre data quality of the collection. To properly evaluate the results of re-OCRing one needs to establish ground truth (GT) data<xref ref-type="fn" rid="fn5">5</xref> that can be used for comparing the old and the new OCRed data. For this purpose we chose manually a set of newspaper and journal pages that had Fraktur font, originating from different publications and decades. Our budget for creation of the GT was minimal: we were able to pay for a subcontractor for the creation of the basic GT, but the budget was limited (about 4,000 &#x20AC;). This also limited the amount of data that could be used for the GT.</p>
<p>The final GT data consists of 479 pages of both journals and newspapers from the time period of 1836&#x2013;1918. Most of the data is from 1870 onwards, as the majority of publications in the collection is from 1870&#x2013;1910 (<xref ref-type="bibr" rid="r10">Kettunen &#x0026; P&#x00E4;&#x00E4;kk&#x00F6;nen, 2016</xref>). When the pages were picked, only the year of publication, type of publication (journal/newspaper), font type and number of pages and characters was known of the data. In the final selection 56&#x0025; of the pages are from journals, 44&#x0025; from newspapers. Journal data has about 950 K of characters, newspaper data 3.06 M. <xref ref-type="fig" rid="fg001">Figures 1</xref> and <xref ref-type="fig" rid="fg002">2</xref> show the amounts of characters in newspaper and journal GT data for different years.</p>
<fig id="fg001">
<label>Fig. 1.</label>
<caption><p>Number of characters in newspaper GT data for each year.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2020_30_Kettunen_fig1.jpg"/>
</fig>
<fig id="fg002">
<label>Fig. 2.</label>
<caption><p>Number of characters in journal GT data for each year.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2020_30_Kettunen_fig2.jpg"/>
</fig>
<p><xref ref-type="fig" rid="fg003">Figure 3</xref>. shows an excerpt of the GT data. Information includes the type of the publication (AIK is a journal, SAN &#x2013; not shown in the figure &#x2013; is a newspaper), year of publication, ISSN of the publication and page information of the page image file. GT, Tesseract, Old (ABBYY FineReader 7/8) and FR11 (ABBYY FineReader 11<xref ref-type="fn" rid="fn6">6</xref>) are different OCR versions of the data.</p>
<fig id="fg003">
<label>Fig. 3.</label>
<caption><p>Example of parallel GT data.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2020_30_Kettunen_fig3.jpg"/>
</fig>
<p>The final ground truth text was corrected manually in two phases: the first correction was by a subcontractor from the output of ABBYY FineReader 11 and the final correction was performed in house at the National Library of Finland. The resulting GT is not errorless, but it is the best reference available. The final data used for this paper has 471,903 parallel lines<xref ref-type="fn" rid="fn7">7</xref> of words or character data. The words in the GT have 3,290,852 characters without spaces, including punctuation, and 4,234,658 characters with spaces. Medium length of the words is 6.97 characters.</p>
<p>The size of the data seems relatively small in comparison with the overall size of the collection which was 1,063,648 pages of Finnish newspapers and journals at the time of creation. With regards to limited means, however, the size can be considered adequate for our purposes. It is far from the one per cent of the original data that <xref ref-type="bibr" rid="r26">Tanner, Mu&#x00F1;oz and Ros (2009)</xref> used for error rate counting with 19<sup>th</sup> century British newspapers, but it is also much larger than typical OCR research paper evaluation data sets. <xref ref-type="bibr" rid="r1">Berg-Kirkpatrick and Klein (2014)</xref> use 300&#x2013;600 lines of text, <xref ref-type="bibr" rid="r5">Drobac, Kauppinen, and Lind&#x00E9;n (2017)</xref> 9,000&#x2013;27,000 lines of text in their re-OCRing trials as evaluation data. <xref ref-type="bibr" rid="r24">Silfverberg, Kauppinen, and Lind&#x00E9;n (2016)</xref> use 40,000 word pairs in postcorrection evaluation and <xref ref-type="bibr" rid="r9">Kettunen (2016)</xref> uses 3,800&#x2013;12,000 word pairs. <xref ref-type="bibr" rid="r4">Dashti (2018)</xref> uses about 300,000 word tokens for evaluation of a real-word error correction algorithm. The ICDAR Post-OCR Text Correction 2017 competition uses a dataset of more than 12 million characters of English and French.<xref ref-type="fn" rid="fn8">8</xref> In comparison to current usage in the field, our 471,903 words and 3,290,852 characters can be considered a medium sized data set.</p>
</sec>
<sec id="s3">
<title>3. Comparison of New OCR to GT and Old OCR</title>
<p>We have described the components of the re-OCRing process and its evaluation thoroughly in Koistinen et al. (<xref ref-type="bibr" rid="r14">2017</xref>, <xref ref-type="bibr" rid="r15">2018</xref>) and <xref ref-type="bibr" rid="r12">Kettunen and Koistinen (2019)</xref>. Here we discuss only the evaluation results of the re-OCR process using the GT data.</p>
<p>Basic statistics of the data show that 85.4&#x0025; of the words in Tesseract&#x2019;s output are identical to words of the ground truth. In the old OCR this figure is 73.1&#x0025; and in ABBYY FineReader v.11 79&#x0025;.</p>
<p>We have performed different analyses for the data and have found that the new Tesseract OCR is clearly better than the old ABBYY Finereader v.7/8 OCR in all respects. Tesseract OCR is also better than ABBYY FineReader v. 11 OCR for the same data (Koistinen et al., <xref ref-type="bibr" rid="r14">2017</xref>, <xref ref-type="bibr" rid="r15">2018</xref>). <xref ref-type="table" rid="tb001">Table 1</xref> shows recognition results of the data with two automatic morphological analyzers, Omorfi<xref ref-type="fn" rid="fn9">9</xref> and a version of Omorfi<xref ref-type="fn" rid="fn10">10</xref> that has some enhanced capability to recognize 19<sup>th</sup> century Finnish. We call this version HisOmorfi. We have earlier used morphological analyzers to get an overall picture of the word level correctness of the data in <xref ref-type="bibr" rid="r10">Kettunen and P&#x00E4;&#x00E4;kk&#x00F6;nen (2016)</xref> and <xref ref-type="bibr" rid="r11">Kettunen, P&#x00E4;&#x00E4;kk&#x00F6;nen, and Koistinen (2016)</xref> without available ground truth. Although the method is prone to estimation errors, it gives a good enough analysis of the data and it is easy to use.</p>
<table-wrap id="tb001">
<label>Table 1.</label>
<caption>
<p>Recognition rates for different comparable data: 471,903 words.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top"></th>
<th align="left" valign="top">Ground truth version</th>
<th align="left" valign="top">Tesseract OCR version</th>
<th align="left" valign="top">Current (old) OCR version</th>
<th align="left" valign="top">FR11 OCR version</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Omorfi 0.3</td>
<td align="left" valign="top">88,413 unrecognized words<break/>81.26&#x0025; recognition rate</td>
<td align="left" valign="top">102,507 unrecognized words<break/>78.27&#x0025; recognition rate</td>
<td align="left" valign="top">107,838 unrecognized words<break/>77.14&#x0025; recognition rate</td>
<td align="left" valign="top">69,461 unrecognized words<break/>85.28&#x0025; recognition rate</td>
</tr>
<tr>
<td align="left" valign="top">HisOmorfi</td>
<td align="left" valign="top">24,054 unrecognized words<break/>94.9&#x0025; recognition rate</td>
<td align="left" valign="top">47,747 unrecognized words<break/>89.88&#x0025; recognition rate</td>
<td align="left" valign="top">89,800 unrecognized words<break/>80.97&#x0025; recognition rate</td>
<td align="left" valign="top">65,984 unrecognized words<break/>86.01 recognition rate</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Plain Omorfi recognizes Tesseract words slightly better than the words of current OCR, the difference being 1.13&#x0025; units. The seemingly small difference is caused by the fact that HisOmorfi was used in the re-OCRing process to choose words from output of Tesseract and it favors <italic>w</italic> to <italic>v</italic>;<xref ref-type="fn" rid="fn11">11</xref> thus more words with <italic>w</italic> than <italic>v</italic> are produced in the process. The old OCR words have 27,127 w&#x2019;s, Tesseract OCR words 64,180, GT 74,046 and FR11 only 3,732. Plain Omorfi does not recognize most of the words that include <italic>w,</italic> but HisOmorfi is able to recognize them, which is shown in the high recognition percentage in Tesseract&#x2019;s and GT&#x2019;s HisOmorfi result column. The words OCRed with Tesseract achieve almost a 9&#x0025; unit improvement in recognition with HisOmorfi compared to current OCR.</p>
<sec id="s3a">
<title>3.1. Precision, Recall and F-score</title>
<p>The GT data allows the usage of other evaluation measures, too. We can use for example standard measures of recall and precision and their combination, F-score (<xref ref-type="bibr" rid="r19">Manning &#x0026; Sch&#x00FC;tze, 1999</xref>, pp. 267&#x2013;270; <xref ref-type="bibr" rid="r20">M&#x00E4;rgner &#x0026; El Abed, 2014</xref>), to get an overall picture of the results. These measures that originate from information retrieval evaluation have been used in both postcorrection and re-OCRing evaluations. Other similar measures exist, too, but many of them, as for example correction rate (CR) used in Silfverberg et al. (2016), are closely related to P/R scores and based on the same basic ideas. Recall and precision measures are useful also in the sense that they allow more detailed analysis of the results.</p>
<p>The re-OCRed data consists of four different types of words: 1) true positives (TP) are originally wrongly OCRed words that are corrected in the re-OCRing; 2) false positives (FP) are correct words that are changed to a misspelling in the re-OCRing; 3) false negatives (FN) are wrongly spelled words that are still wrong after the re-OCRing; 4) true negatives (TN) are correct words that are correct after the re-OCRing.</p>
<p>Out of these we define Recall, R, as <italic>TP / (TP&#x002B;FN)</italic>, Precision, P, as <italic>TP / (TP&#x002B;FP)</italic> and F-score, F, as <italic>2*R*P / (R &#x002B; P)</italic> (<xref ref-type="bibr" rid="r19">Manning &#x0026; Sch&#x00FC;tze, 1999</xref>, pp. 268&#x2013;269). Correction rate, a novel and slightly modified metric, used in Silfverberg et al. (2016), is defined as (TP-FP)/(TP&#x002B;FN).</p>
<p><xref ref-type="table" rid="tb002">Table 2</xref> shows P/R results and F-scores of the re-OCRed data and the correction rate for the data. We show two results: one in the left column is a comparison of the data without cleaning. The result in the right column shows results with punctuation and all other non-alphabetic and non-number characters removed from the lines. Removed character set is: .,;\&#x2019;:\&#x201D;\&#x2019;_!@&#x0023;&#x0025;&#x0026;*()&#x002B;&#x003D;&#x003C;&#x003E;[]{}?\\/&#x2014;&#x02DC;&#x007C;&#x02C6;\&#x201C;&#x201E;&#x00A6;&#x00AB;&#x00A9;&#x00BB;&#x00AE;&#x00B0;&#x00A1;.</p>
<table-wrap id="tb002">
<label>Table 2.</label>
<caption>
<p>P/R results of re-OCRing in comparison to old OCR.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left" valign="top">Basic results with parallel columns</td>
<td align="left" valign="top">Results without non-alphabet data</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Recall &#x003D; 0.72<break/>Precision &#x003D; 0.73<break/>F-score &#x003D; 0.73<break/>Correction rate &#x003D; 0.46</td>
<td align="left" valign="top">Recall &#x003D; 0.74<break/>Precision &#x003D; 0.77<break/>F-score &#x003D; 0.75<break/>Correction rate &#x003D; 0.51</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We concentrate on the analysis of the left column results in more detail from now on. The number of erroneous words in the data is 126,758 (and errorless thus 345,145). Re-OCRing corrects 90,877 of errors (true positives, 71.7&#x0025; of errors) and leaves 35,881 uncorrected (false negatives, 28.3&#x0025; of errors). It also produces 32,953 new errors to the data (false positives). In general it seems thus, that the recall of the re-OCRed data with regards to erroneous words is satisfactory, but precision is low, as the process produces quite a lot of new errors. This harms the overall result.</p>
<p>In comparison, a simple Levenshtein distance based postcorrection algorithm used in <xref ref-type="bibr" rid="r9">Kettunen (2016)</xref> for small data samples of 3,850 &#x2013; 12,000 word pairs had usually a high precision of 0.85&#x2013;0.95, but much lower recall than our re-OCRing process. With the current data set the postcorrection algorithm achieves recall of 0.47, precision of 0.42 and F-score of 0.44. If non-alphabetic data is pruned from the data, the F-score is 0.57. The postcorrection algorithm handles only lower case characters, which affects its results. If case distinction is omitted in words and non-alphabet data pruned, postcorrection algorithm&#x2019;s best F-score is 0.63.</p>
<sec id="s3a1">
<title>3.1.1. False and True Positives</title>
<p>Recall and precisions figures give an overall picture of the improvements in the re-OCRing process. In order to get a more detailed view of the process, one needs to examine the set of false and true positives more closely: what are the most frequent errors, what kind of errors are corrected, what new errors generated. In our case part of the false positives of the re-OCRed data is due to the recurring trouble with quote marking or division of the word on two lines when it ends with a hyphen. These data, when re-OCRed, miss a quote or two in the result word, or it contains the HTML code <italic>&#x0026;quote;</italic> instead of the quote itself. Many words are also incorrectly divided on the line. The same applies to false negatives, too. The number of all faulty word divisions in the data of false and true positives together is about 10,000, which makes this error type one of the most common. Missing punctuation or extra punctuation also causes errors. This can be seen in the right column of <xref ref-type="table" rid="tb002">Table 2</xref> where results with cleaned output are shown.</p>
<p>When true positives are examined, one can see that about 54&#x0025; of the errors corrected are one character corrections and about 89&#x0025; are 1&#x2013;3 character corrections. But re-OCR corrects also truly hard errors, where more than three characters are corrected. Even errors with a Levenshtein distance (<xref ref-type="bibr" rid="r16">Levenshtein, 1966</xref>)<xref ref-type="fn" rid="fn12">12</xref> (LD) over 10 are corrected, a few examples being the word pairs of edit distance of 11 in <xref ref-type="table" rid="tb003">Table 3</xref>.</p>
<table-wrap id="tb003">
<label>Table 3.</label>
<caption>
<p>Corrections of Levenshtein distance of 11.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left" valign="top">Original OCR</td>
<td align="left" valign="top">Tesseract 3.04.01</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">eiifuroauffellt&#x00BB;</td>
<td align="left" valign="top">esikuwauksellisesti</td>
</tr>
<tr>
<td align="left" valign="top">KarjlltijoloSluSyhbiStytsen</td>
<td align="left" valign="top">Karjanjalostusyhdistyksen</td>
</tr>
<tr>
<td align="left" valign="top">ttfcnf&#x00E4;Mt&#x00E4;mifeSf&#x00E4;,</td>
<td align="left" valign="top">itsens&#x00E4;kielt&#x00E4;misess&#x00E4;,</td>
</tr>
<tr>
<td align="left" valign="top">liiannfiljtccvillc</td>
<td align="left" valign="top">maansihteerille</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Another example of corrected hard errors are 2,376 words that have a Levenshtein edit distance of five. When the error count is this high, words are becoming unintelligible. Some examples of corrections with five errors are shown in <xref ref-type="table" rid="tb004">Table 4</xref>.</p>
<table-wrap id="tb004">
<label>Table 4.</label>
<caption>
<p>Corrections of Levenshtein distance of 5.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left" valign="top">Original OCR</td>
<td align="left" valign="top">Tesseract 3.04.01</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">fofoufsessct,</td>
<td align="left" valign="top">kokouksessa</td>
</tr>
<tr>
<td align="left" valign="top">silmciyfsert</td>
<td align="left" valign="top">silm&#x00E4;yksen</td>
</tr>
<tr>
<td align="left" valign="top">ncihbessci&#x00E4;n</td>
<td align="left" valign="top">n&#x00E4;hdess&#x00E4;&#x00E4;n</td>
</tr>
<tr>
<td align="left" valign="top">ro&#x00E4;liH&#x00E4;</td>
<td align="left" valign="top">w&#x00E4;lill&#x00E4;.</td>
</tr>
<tr>
<td align="left" valign="top">yfsincicin.</td>
<td align="left" valign="top">yksin&#x00E4;&#x00E4;n</td>
</tr>
<tr>
<td align="left" valign="top">tylyybestcicin</td>
<td align="left" valign="top">tylyydest&#x00E4;&#x00E4;n</td>
</tr>
<tr>
<td align="left" valign="top">fitsattbestaan,</td>
<td align="left" valign="top">kitsaudestaan.</td>
</tr>
<tr>
<td align="left" valign="top">Iyw&#x00E4;zlyllln</td>
<td align="left" valign="top">Jyw&#x00E4;skyl&#x00E4;n</td>
</tr>
<tr>
<td align="left" valign="top">pairoana</td>
<td align="left" valign="top">p&#x00E4;iw&#x00E4;n&#x00E4;</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The bigger the error count is, the harder the error would be to correct for a postcorrection software, and here lies the strength of re-OCRing at its best. Reynaert (2016), e.g., states that his postcorrection system of Dutch, TICCL, corrects best errors of LD 1&#x2013;2. It can be run with LD 3, &#x201C;but this has a high processing cost and most probably results in lower precision.&#x201D; Error correction for LD 4 and higher values he considers too ambitious for the time being. This is also one of the conclusions in <xref ref-type="bibr" rid="r3">Choudhury, Thomas, Mukherjee, Basu, and Ganguly (2007)</xref>.<xref ref-type="fn" rid="fn13">13</xref></p>
<p>The number of corrected words with edit distances of 1&#x2013;10 in true positives of our re-OCR process can be seen in <xref ref-type="table" rid="tb005">Table 5</xref>.</p>
<table-wrap id="tb005">
<label>Table 5.</label>
<caption>
<p>Number of corrected words with edit distances of 1&#x2013;10: 99.2&#x0025; of all the true positives.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left" valign="top">Edit distance</td>
<td align="left" valign="top">Number of corrections</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">LD 1</td>
<td align="left" valign="top">47,783</td>
</tr>
<tr>
<td align="left" valign="top">LD 2</td>
<td align="left" valign="top">22,713</td>
</tr>
<tr>
<td align="left" valign="top">LD 3</td>
<td align="left" valign="top">9,182</td>
</tr>
<tr>
<td align="left" valign="top">LD 4</td>
<td align="left" valign="top">4,375</td>
</tr>
<tr>
<td align="left" valign="top">LD 5</td>
<td align="left" valign="top">2,376</td>
</tr>
<tr>
<td align="left" valign="top">LD 6</td>
<td align="left" valign="top">1,519</td>
</tr>
<tr>
<td align="left" valign="top">LD 7</td>
<td align="left" valign="top">920</td>
</tr>
<tr>
<td align="left" valign="top">LD 8</td>
<td align="left" valign="top">629</td>
</tr>
<tr>
<td align="left" valign="top">LD 9</td>
<td align="left" valign="top">423</td>
</tr>
<tr>
<td align="left" valign="top">LD 10</td>
<td align="left" valign="top">315</td>
</tr>
<tr>
<td align="left" valign="top"></td>
<td align="left" valign="top">SUM &#x003D; 90,235 (of 90,877 total true positives)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s3b">
<title>3.2. Further Analysis of Results</title>
<p>Overall, the sum of character errors in the data decreased from old OCR&#x2019;s 293,364 to 220,254 in Tesseract OCR, which is about a 25&#x0025; decrease. Tesseract produces significantly more errorless words than the old OCR (403,069 vs. 345,145), but it produces also more character errors per erroneous word. The old OCRing has about 2.32 errors per erroneous word, Tesseract OCR 3.2. This is a mixed blessing: erroneous words are encountered more seldom in Tesseract&#x2019;s output, but they may be harder to read and understand when they occur.</p>
<p>Mean length of the word tokens &#x2013; including punctuation &#x2013; in different versions of OCR does not vary much: in the current OCR it is 6.94 characters, in GT 6.97 and in Tesseract OCR 6.99 characters. The length of words does not bring great variance to improvement of OCR. Words that are up to seven characters long (total of 286,066) in the current OCR get F score of 0.72 and correction rate of 0.44. Words that are longer than seven characters (total of 185,387) get F score of 0.73 and correction rate of 0.47.</p>
<p>Frequency analysis of characters in different versions of the OCR does not show significant differences in alphabetical characters between GT and Tesseract. Tesseract seems to produce too many zeros and ones out of numbers and in other characters dash and backslash are over generated.</p>
<p>The number of different word types (unique words) in the current OCR data is 176,625. In GT data it is 135,433 and in Tesseract OCR data it is 156,459. The number of hapax legomena, that is words occurring only once, is 97,330 in GT, 120,878 in Tesseract OCR, and 140,802 in current OCR. The bigger number of unique words is one clear sign of more errors in the word data (<xref ref-type="bibr" rid="r7">Ghosh, Chakrabortya, Parui, &#x0026; Majumder, 2016</xref>).</p>
</sec>
<sec id="s3c">
<title>3.3. Combined OCR Results</title>
<p>Usage of combined results of several OCR software has proven fruitful in many evaluations (e.g. <xref ref-type="bibr" rid="r13">Klein &#x0026; Kopel, 2002</xref>; <xref ref-type="bibr" rid="r28">Volk, Furrer, &#x0026; Sennrich, 2011</xref>). As we have in our GT data results of another OCR software, ABBYY FineReader v.11, we can also evaluate the combined optimal results of Tesseract and ABBYY FineReader v.11. Recall of the optimal result of two combined OCR engines is 0.81, precision 0.95, F score 0.88 and correction rate 0.77 as shown in <xref ref-type="table" rid="tb006">Table 6</xref> in comparison to Tesseract&#x2019;s results only. Unfortunately we do not have available the other OCR engine for final re-OCRing, therefore we can only show upper limits for the results with these two engines.</p>
<table-wrap id="tb006">
<label>Table 6.</label>
<caption>
<p>Combined P/R and corrections rate results of Tesseract and ABBYY FineReader 11.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left" valign="top">Basic results: Tesseract only</td>
<td align="left" valign="top">Results with Tesseract &#x002B; FR11</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Recall &#x003D; 0.72<break/>Precision &#x003D; 0.73<break/>F measure&#x003D; 0.73<break/>Correction rate &#x003D; 0.46</td>
<td align="left" valign="top">Recall &#x003D; 0.81<break/>Precision &#x003D; 0.95<break/>F measure&#x003D; 0.88<break/>Correction rate &#x003D; 0.77</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3d">
<title>3.4. Upper and Lower Case Characters</title>
<p>Upper and lower casing is a basic distinction in the Latin alphabet writing systems, and OCRing should maintain the distinction. We analyzed the effect of word initial capitalization on the results. If capitalization is neutralized from the data, results are almost the same. Thus it seems that the re-OCRing process is recognizing upper and lower case letters well.</p>
<p><xref ref-type="fig" rid="fg004">Figures 4</xref> and <xref ref-type="fig" rid="fg005">5</xref> show the distribution of upper and lower case letters of the Finnish basic alphabet in the ground truth and Tesseract data. Rare characters like <italic>&#x00FC;, &#x00E1;</italic> etc. that occur only in foreign words are left out of the figures.</p>
<fig id="fg004">
<label>Fig. 4.</label>
<caption><p>Upper case letters of ground truth and Tesseract.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2020_30_Kettunen_fig4.jpg"/>
</fig>
<fig id="fg005">
<label>Fig. 5.</label>
<caption><p>Lower case letters of ground truth and Tesseract.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2020_30_Kettunen_fig5.jpg"/>
</fig>
<p>As can be seen from the figures, there are no enormous drops or spikes in recognition of any letter. Thus the OCRing process seems to handle the main characters of the Finnish alphabet quite consistently.</p>
</sec>
<sec id="s3e">
<title>3.5. Stepping Outside of the Sandbox</title>
<p>Usage of a GT collection in OCR improvement is of vital importance. It can, however, have some drawbacks. Firstly, the collection may not be as representative as it should. Secondly, usage of the GT collection during development and evaluation may lead to an over-fit of data. To circumvent these possible effects, we show also quality improvement outside the GT data. After initial development and evaluation of the re-OCRing process with the GT data, we started final testing of the re-OCRing with newspaper data. We chose for testing <italic>Uusi Suometar</italic>, a newspaper which appeared in 1869&#x2013;1918 and has 86,068 pages. <xref ref-type="table" rid="tb007">Table 7</xref> shows results of a 30 years&#x2019; re-OCRing of this newspaper. Word level recognition rates using morphological analyzer are given for the old and the new OCR.</p>
<table-wrap id="tb007">
<label>Table 7.</label>
<caption>
<p>Recognition rates of current and new OCR words of Uusi Suometar with morphological analyzer HisOmorfi.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left" valign="top">Year</td>
<td align="left" valign="top">Words</td>
<td align="left" valign="top">Current OCR</td>
<td align="left" valign="top">Re-OCR</td>
<td align="left" valign="top">Gain in &#x0025; units</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1869</td>
<td valign="top" align="left">658,685</td>
<td valign="top" align="left">69.42&#x0025;</td>
<td valign="top" align="left">86.59&#x0025;</td>
<td valign="top" align="left">17.18&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1870</td>
<td valign="top" align="left">655,772</td>
<td valign="top" align="left">66.98&#x0025;</td>
<td valign="top" align="left">85.54&#x0025;</td>
<td valign="top" align="left">18.57&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1871</td>
<td valign="top" align="left">910,707</td>
<td valign="top" align="left">72.81&#x0025;</td>
<td valign="top" align="left">87.27&#x0025;</td>
<td valign="top" align="left">14.46&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1872</td>
<td valign="top" align="left">930,493</td>
<td valign="top" align="left">75.09&#x0025;</td>
<td valign="top" align="left">88.25&#x0025;</td>
<td valign="top" align="left">13.16&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1873</td>
<td valign="top" align="left">892,745</td>
<td valign="top" align="left">74.61&#x0025;</td>
<td valign="top" align="left">87.04&#x0025;</td>
<td valign="top" align="left">12.43&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1874</td>
<td valign="top" align="left">921,603</td>
<td valign="top" align="left">72.70&#x0025;</td>
<td valign="top" align="left">86.09&#x0025;</td>
<td valign="top" align="left">13.39&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1875</td>
<td valign="top" align="left">1,075,339</td>
<td valign="top" align="left">70.62&#x0025;</td>
<td valign="top" align="left">85.52&#x0025;</td>
<td valign="top" align="left">14.90&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1876</td>
<td valign="top" align="left">1,223,455</td>
<td valign="top" align="left">71.50&#x0025;</td>
<td valign="top" align="left">85.51&#x0025;</td>
<td valign="top" align="left">14.01&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1877</td>
<td valign="top" align="left">1,818,803</td>
<td valign="top" align="left">72.09&#x0025;</td>
<td valign="top" align="left">84.79&#x0025;</td>
<td valign="top" align="left">12.70&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1878</td>
<td valign="top" align="left">2,193,869</td>
<td valign="top" align="left">70.78&#x0025;</td>
<td valign="top" align="left">84.70&#x0025;</td>
<td valign="top" align="left">13.91&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1879</td>
<td valign="top" align="left">2,238,412</td>
<td valign="top" align="left">73.52&#x0025;</td>
<td valign="top" align="left">86.09&#x0025;</td>
<td valign="top" align="left">12.57&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1880</td>
<td valign="top" align="left">2,135,334</td>
<td valign="top" align="left">70.11&#x0025;</td>
<td valign="top" align="left">85.85&#x0025;</td>
<td valign="top" align="left">15.74&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1881</td>
<td valign="top" align="left">2,617,533</td>
<td valign="top" align="left">67.98&#x0025;</td>
<td valign="top" align="left">84.26&#x0025;</td>
<td valign="top" align="left">16.28&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1882</td>
<td valign="top" align="left">2,736,109</td>
<td valign="top" align="left">62.41&#x0025;</td>
<td valign="top" align="left">82.94&#x0025;</td>
<td valign="top" align="left">20.53&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1883</td>
<td valign="top" align="left">3,182,853</td>
<td valign="top" align="left">70.19&#x0025;</td>
<td valign="top" align="left">82.17&#x0025;</td>
<td valign="top" align="left">11.98&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1884</td>
<td valign="top" align="left">3,365,356</td>
<td valign="top" align="left">69.60&#x0025;</td>
<td valign="top" align="left">81.67&#x0025;</td>
<td valign="top" align="left">12.07&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1885</td>
<td valign="top" align="left">3,965,632</td>
<td valign="top" align="left">68.11&#x0025;</td>
<td valign="top" align="left">82.53&#x0025;</td>
<td valign="top" align="left">14.42&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1886</td>
<td valign="top" align="left">4,247,173</td>
<td valign="top" align="left">68.21&#x0025;</td>
<td valign="top" align="left">82.12&#x0025;</td>
<td valign="top" align="left">13.92&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1887</td>
<td valign="top" align="left">4,393,615</td>
<td valign="top" align="left">65.25&#x0025;</td>
<td valign="top" align="left">82.16&#x0025;</td>
<td valign="top" align="left">16.91&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1888</td>
<td valign="top" align="left">5,030,160</td>
<td valign="top" align="left">70.27&#x0025;</td>
<td valign="top" align="left">82.52&#x0025;</td>
<td valign="top" align="left">12.25&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1889</td>
<td valign="top" align="left">5,152,628</td>
<td valign="top" align="left">65.71&#x0025;</td>
<td valign="top" align="left">81.41&#x0025;</td>
<td valign="top" align="left">15.70&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1890</td>
<td valign="top" align="left">5,676,613</td>
<td valign="top" align="left">64.69&#x0025;</td>
<td valign="top" align="left">80.71&#x0025;</td>
<td valign="top" align="left">16.02&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1891</td>
<td valign="top" align="left">6,275,418</td>
<td valign="top" align="left">65.16&#x0025;</td>
<td valign="top" align="left">81.21&#x0025;</td>
<td valign="top" align="left">16.05&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1892</td>
<td valign="top" align="left">6,372,156</td>
<td valign="top" align="left">62.01&#x0025;</td>
<td valign="top" align="left">80.92&#x0025;</td>
<td valign="top" align="left">18.91&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1893</td>
<td valign="top" align="left">6,331,905</td>
<td valign="top" align="left">60.62&#x0025;</td>
<td valign="top" align="left">80.16&#x0025;</td>
<td valign="top" align="left">19.54&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1894</td>
<td valign="top" align="left">6,618,095</td>
<td valign="top" align="left">66.63&#x0025;</td>
<td valign="top" align="left">82.10&#x0025;</td>
<td valign="top" align="left">15.47&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1895</td>
<td valign="top" align="left">6,485,491</td>
<td valign="top" align="left">67.96&#x0025;</td>
<td valign="top" align="left">82.10&#x0025;</td>
<td valign="top" align="left">14.14&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1896</td>
<td valign="top" align="left">6,802,715</td>
<td valign="top" align="left">64.29&#x0025;</td>
<td valign="top" align="left">81.43&#x0025;</td>
<td valign="top" align="left">17.14&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1897</td>
<td valign="top" align="left">7,366,360</td>
<td valign="top" align="left">61.69&#x0025;</td>
<td valign="top" align="left">80.14&#x0025;</td>
<td valign="top" align="left">18.45&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">1898</td>
<td valign="top" align="left">7,113,723</td>
<td valign="top" align="left">63.87&#x0025;</td>
<td valign="top" align="left">80.50&#x0025;</td>
<td valign="top" align="left">16.63&#x0025;</td>
</tr>
<tr>
<td valign="top" align="left">Total</td>
<td valign="top" align="left">109,388,752</td>
<td valign="top" align="left"></td>
<td valign="top" align="left"></td>
<td valign="top" align="left">Average 15.3&#x0025;</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Re-OCRing is improving the quality of the newspaper clearly and consistently. The average improvement for the whole period of 30 years is 15.3&#x0025; units. The largest improvement is 20.5&#x0025; units, and the smallest 12&#x0025; units. Although the usage of morphological recognition is no guarantee of the rightness of the result, these big improvements in recognition rate are a clear indication of quality improvement.</p>
</sec>
</sec>
<sec id="s4">
<title>4. Conclusion</title>
<p>We have described in this paper generally our Optical Character Recognition GT sample for Finnish historical newspapers and journals. The data consists of 479 pages and 471,903 parallel words. It has been used in development and evaluation of a new OCRing process for our collection&#x2019;s Finnish Fraktur font part using Tesseract&#x2019;s open source OCR engine v. 3.04.01. According to our evaluation results, we can achieve a clear improvement on the OCR quality with Tesseract in the 500K GT data (Koistinen et al., 2017, 2018). All our analyses show that the re-OCR procedure works relatively well: it does not shorten or lengthen words significantly and it reduces the number of word types in Tesseract OCR in comparison to current OCR. Recognition of the produced words by morphological analyzers is improved with 9&#x0025; units and P/R figures of the correction effect of the re-OCR are satisfactory. 89&#x0025; of the corrections made to the words are corrections of 1&#x2013;3 characters.</p>
<p>The GT data has been created as a tool for quality control of the re-OCRing process. We have published the word lists, ALTO XML and image files of the data on our web site<italic> digi.kansalliskirjasto.fi/opendata</italic> as open data. We have earlier published the text files of the collection&#x2019;s 1771&#x2013;1910 part (<xref ref-type="bibr" rid="r21">P&#x00E4;&#x00E4;kk&#x00F6;nen, Kervinen, Nivala, Kettunen, &#x0026; M&#x00E4;kel&#x00E4;, 2016</xref>) with metadata, ALTO XML and plain text. Publication of the GT data benefits those, who work on OCRing historical Finnish or who develop postcorrection algorithms for OCRing. Also development work of general OCR tools such as Transkribus<xref ref-type="fn" rid="fn14">14</xref> may benefit from the data. Earlier we have given the GT data for research use on demand, and it has been used in training of Ocropy OCR engine for the historical data (<xref ref-type="bibr" rid="r5">Drobac et al., 2017</xref>).</p>
<p>The old saying in computational linguistics is that <italic>more data is better data</italic>, and that applies in the case of OCR data too. It would have been nice to have an even larger OCR GT data set, but with regards to resources at use, we are contented with the now available data. The data adds a useful resource for repertoire of somehow under-resourced collections of 19<sup>th</sup> century Finnish. We hope the data has use also outside of OCR and postcorrection field for those who work in the digital humanities.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>This work was supported by the Academy of Finland as part of the project Computational History and Transformation of Public Discourse in Finland, 1640&#x2013;1910, decision number 293341.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="r1"><mixed-citation>Berg-Kirkpatrick, T., &#x0026; Klein, D. (2014). Improved typesetting models for historical OCR. In K. Toutanova &#x0026; H. Wu (Eds.), <italic>Proceedings of the 52nd annual meeting of the Association for Computational Linguistics</italic> (pp. 118&#x2013;123). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3115/v1/P14-2020">https://doi.org/10.3115/v1/P14-2020</ext-link>.</mixed-citation></ref>
<ref id="r2"><mixed-citation>Carrasco, R. C. (2014). An open-source OCR evaluation tool. In A. Antonacopoulos &#x0026; K. U. Schulz (Eds.), <italic>Proceedings of the first international conference on digital access to textual cultural heritage (DATeCH &#x2018;14 )</italic> (pp. 179&#x2013;184). New York: ACM. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/2595188.2595221">https://doi.org/10.1145/2595188.2595221</ext-link>.</mixed-citation></ref>
<ref id="r3"><mixed-citation>Choudhury, M., Thomas, M., Mukherjee, A., Basu, A., &#x0026; Ganguly, N. (2007). How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach. In C. Biemann, I. Matveeva, R. Mihalcea, &#x0026; D. Radev (Eds.), <italic>TextGraphs-2: Graph-based algorithms for natural language processing &#x2013; Proceedings of the workshop</italic> (HTL-NAACL 2007) (pp. 81&#x2013;88). New Brunswick, NL: Association for Computational Linguistics. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/pdf/physics/0703198.pdf">https://arxiv.org/pdf/physics/0703198.pdf</ext-link>.</mixed-citation></ref>
<ref id="r4"><mixed-citation>Dashti, S. M. (2018). Real-word error correction with trigrams: Correcting multiple errors in a sentence. <italic>Language Resources and Evaluation, 52</italic>, 485&#x2013;502. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s10579-017-9397-4">https://doi.org/10.1007/s10579-017-9397-4</ext-link>.</mixed-citation></ref>
<ref id="r5"><mixed-citation>Drobac, S., Kauppinen, P., &#x0026; Lind&#x00E9;n, K. (2017). OCR and post-correction of historical Finnish texts. In J. Tiedermann (Ed.), <italic>NoDaLiDa, Proceedings of the 21th Nordic conference on computational linguistics</italic> (pp. 70&#x2013;76). Link&#x00F6;ping: Link&#x00F6;ping University Electronic Press. Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="https://www.aclweb.org/anthology/W17-0209.pdf">https://www.aclweb.org/anthology/W17-0209.pdf</ext-link>.</mixed-citation></ref>
<ref id="r6"><mixed-citation>Dunning, A. (2012). <italic>European newspaper survey report</italic>. Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="http://www.europeana-newspapers.eu/wp-content/uploads/2012/04/D4.1-Europeana-newspapers-survey-report.pdf">http://www.europeana-newspapers.eu/wp-content/uploads/2012/04/D4.1-Europeana-newspapers-survey-report.pdf</ext-link>.</mixed-citation></ref>
<ref id="r7"><mixed-citation>Ghosh, K., Chakrabortya, A., Parui, S. K., &#x0026; Majumder, P. (2016). Improving information retrieval performance on OCRed text in the absence of clean text ground truth. <italic>Information Processing and Management</italic>, <italic>52</italic>(5), 873&#x2013;884. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.ipm.2016.03.006">https://doi.org/10.1016/j.ipm.2016.03.006</ext-link>.</mixed-citation></ref>
<ref id="r8"><mixed-citation>J&#x00E4;rvelin, A., Keskustalo, H., Sormunen, E., Saastamoinen, M., &#x0026; Kettunen, K. (2016). Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. <italic>Journal of the Association for Information Science and Technology</italic>,<italic> 67</italic>(12), 2928&#x2013;2946. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1002/asi.23379">https://doi.org/10.1002/asi.23379</ext-link>.</mixed-citation></ref>
<ref id="r9"><mixed-citation>Kettunen K. (2016). Keep, change or delete? Setting up a low resource OCR post-correction framework for a digitized old Finnish newspaper collection. In D. Calvanese, D. De Nart, &#x0026; C. nTasso (Eds.), <italic>Digital libraries on the move (IRCDL 2015)</italic> (Vol. 612, pp. 95&#x2013;103). Communications in Computer and Information Science. Cham, CH: Springer. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-319-41938-1_11">https://doi.org/10.1007/978-3-319-41938-1_11</ext-link>.</mixed-citation></ref>
<ref id="r10"><mixed-citation>Kettunen, K., &#x0026; P&#x00E4;&#x00E4;kk&#x00F6;nen, T. (2016). Measuring lexical quality of a historical Finnish newspaper collection &#x2013; Analysis of garbled OCR data with basic language technology tools and means. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, &#x2026;, S. Piperidis (Eds.), <italic>Proceedings of the tenth International Conference on Language Resources and Evaluation (LREC 2016)</italic> (pp. 956&#x2013;961). Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdf">http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdf</ext-link>.</mixed-citation></ref>
<ref id="r11"><mixed-citation>Kettunen, K., P&#x00E4;&#x00E4;kk&#x00F6;nen, T., &#x0026; Koistinen, M. (2016). Between diachrony and synchrony: Evaluation of lexical quality of a digitized historical Finnish newspaper and journal collection with morphological analyzers. In I. Skadi&#x0146;a &#x0026; R. Rozis (Eds.), <italic>Human language technologies &#x2013; The Baltic perspective. Proceedings of the</italic> <italic>seventh international conference</italic> (Baltic HLT 2016) (pp. 122&#x2013;129). Amsterdam: IOS Press. Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="http://ebooks.iospress.nl/volume/human-language-technologies-the-baltic-perspective-proceedings-of-the-seventh-international-conference-baltic-hlt-2016">http://ebooks.iospress.nl/volume/human-language-technologies-the-baltic-perspective-proceedings-of-the-seventh-international-conference-baltic-hlt-2016</ext-link>.</mixed-citation></ref>
<ref id="r12"><mixed-citation>Kettunen, K., &#x0026; Koistinen, M. (2019). Open Source Tesseract in Re-OCR of Finnish Fraktur from 19th and early 20th century newspapers and journals &#x2013; Collected notes on quality improvement. In C. Navarretta, M. Agirrezabal, &#x0026; B. Maegaard (Eds.), <italic>Proceedings of the Digital Humanities in the Nordic countries 4</italic><sup>th</sup><italic> conference (DHN2019)</italic> (pp. 270&#x2013;282). Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="http://ceur-ws.org/Vol-2364/25_paper.pdf">http://ceur-ws.org/Vol-2364/25_paper.pdf</ext-link>.</mixed-citation></ref>
<ref id="r13"><mixed-citation>Klein, S. T., &#x0026; Kopel, M. (2002). A voting system for automatic OCR correction. In J. Callan, P. Kantor, &#x0026; D. Grossmann (Eds.), <italic>Proceedings of the SIGIR 2002 Workshop on information retrieval and OCR: From converting content to grasping meaning</italic> (n.p.). Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="http://boston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein.pdf">http://boston.lti.cs.cmu.edu/callan/Workshops/IR-OCR-02/tklein.pdf</ext-link>.</mixed-citation></ref>
<ref id="r14"><mixed-citation>Koistinen, M., Kettunen, K., &#x0026; P&#x00E4;&#x00E4;kk&#x00F6;nen, T. (2017). Improving optical character recognition of finnish historical newspapers with a combination of Fraktur &#x0026; Antiqua models and image preprocessing. In J. Tiedermann (Ed.), <italic>NoDaLiDa, Proceedings of the 21th Nordic conference on computational linguistics</italic> (pp. 277&#x2013;283). Link&#x00F6;ping: Link&#x00F6;ping University Electronic Press. Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="http://www.ep.liu.se/ecp/131/038/ecp17131038.pdf">http://www.ep.liu.se/ecp/131/038/ecp17131038.pdf</ext-link>.</mixed-citation></ref>
<ref id="r15"><mixed-citation>Koistinen, M., Kettunen, K., &#x0026; Kervinen, J. (2018). Bad OCR has a nasty character &#x2013; re-OCRing historical Finnish newspaper material 1771&#x2013;1910. Submitted to <italic>International Journal of Document Recognition and Analysis</italic>.</mixed-citation></ref>
<ref id="r16"><mixed-citation>Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. <italic>Soviet Physics Doklady</italic>, <italic>10</italic>(8), 707&#x2013;710.</mixed-citation></ref>
<ref id="r17"><mixed-citation>Lopresti, D. (2009). Optical character recognition errors and their effects on natural language processing. <italic>International Journal on Document Analysis and Recognition, 12</italic>, 141&#x2013;151. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s10032-009-0094-8">https://doi.org/10.1007/s10032-009-0094-8</ext-link>.</mixed-citation></ref>
<ref id="r18"><mixed-citation>M&#x00E4;kel&#x00E4;, Eetu. (2016). LAS: An integrated language analysis tool for multiple languages. <italic>The Journal of Open Source Software, 1</italic>(6):35, 1&#x2013;2. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.21105/joss.00035">https://doi.org/10.21105/joss.00035</ext-link>.</mixed-citation></ref>
<ref id="r19"><mixed-citation>Manning, C. D., &#x0026; Sch&#x00FC;tze, H. (1999). <italic>Foundations of statistical natural language processing</italic>. Cambridge, MA: The MIT Press.</mixed-citation></ref>
<ref id="r20"><mixed-citation>M&#x00E4;rgner, V., &#x0026; El Abed, H. (2014). Tools and metrics for document analysis system evaluation. In D. Doermann &#x0026; K. Tombre (Eds.), <italic>Handbook of document image processing and recognition</italic> (pp. 1011&#x2013;1036). London: Springer Verlag.</mixed-citation></ref>
<ref id="r21"><mixed-citation>P&#x00E4;&#x00E4;kk&#x00F6;nen, T., Kervinen, J., Nivala, A., Kettunen, K., &#x0026; M&#x00E4;kel&#x00E4;, E. (2016). Exporting Finnish digitized historical newspaper contents for offline use. <italic>D-Lib Magazine</italic>, <italic>22</italic>(7/8), n.p. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1045/july2016-paakkonen">https://doi.org/10.1045/july2016-paakkonen</ext-link>.</mixed-citation></ref>
<ref id="r22"><mixed-citation>Pletschacher, S., Clausner, C., &#x0026; Antonacopoulos, A. (2015). Europeana newspapers OCR workflow evaluation. In B. Co&#x00FC;asnon, V. M&#x00E4;rgner, V. Frinken, &#x0026; B. Barrett (Eds.), <italic>HIP &#x2018;15, Proceedings of the 3rd International workshop on historical document imaging and processing</italic> (pp. 39&#x2013;46). New York: ACM Digital Library. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/2809544.2809554">https://doi.org/10.1145/2809544.2809554</ext-link>.</mixed-citation></ref>
<ref id="r23"><mixed-citation>Reynaert, M. (2016). OCR Post-correction evaluation of early Dutch books online &#x2013; Revisited. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, &#x2026;, S. Piperidis (Eds.), <italic>Proceedings of the tenth International Conference on Language Resources and Evaluation (LREC 2016)</italic> (pp. 967&#x2013;974). Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="https://pure.uvt.nl/ws/portalfiles/portal/14518959/LREC2016.EDBOeval.FinalSubmittedVersion.redownloaded20160318.pdf">https://pure.uvt.nl/ws/portalfiles/portal/14518959/LREC2016.EDBOeval.FinalSubmittedVersion.redownloaded20160318.pdf</ext-link>.</mixed-citation></ref>
<ref id="r24"><mixed-citation>Silfverberg, M., Kauppinen, P., &#x0026; Lind&#x00E9;n, K. (2016). Data-driven spelling correction using weighted finite-state method. In B. Jurish, A. Maletti, U. Springmann, &#x0026; K.-M. W&#x00FC;rzner, <italic>Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata</italic> (pp. 51&#x2013;59). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="https://aclweb.org/anthology/W/W16/W16-2406.pdf">https://aclweb.org/anthology/W/W16/W16-2406.pdf</ext-link>.</mixed-citation></ref>
<ref id="r25"><mixed-citation><italic>The &#x201C;State of the Art&#x201D;: A Comparative analysis of newspaper digitization to date</italic> (2015). Retrieved January 23, 2020, from <ext-link ext-link-type="uri" xlink:href="https://www.crl.edu/sites/default/files/d6/attachments/events/ICON_Report-State_of_Digitization_final.pdf">https://www.crl.edu/sites/default/files/d6/attachments/events/ICON_Report-State_of_Digitization_final.pdf</ext-link>.</mixed-citation></ref>
<ref id="r26"><mixed-citation>Tanner, S., Mu&#x00F1;oz, T., &#x0026; Ros, P. H. (2009). Measuring mass text digitization quality and usefulness. Lessons learned from assessing the OCR accuracy of the British Library&#x2019;s 19th Century Online Newspaper Archive. <italic>D-Lib Magazine, 15</italic>(8), n.p. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1045/july2009-munoz">https://doi.org/10.1045/july2009-munoz</ext-link>.</mixed-citation></ref>
<ref id="r27"><mixed-citation>Traub, M. C., Samar, T., Ossenbruggen, J. van, He, J., Vries, A. de, &#x0026; Hardman, L. (2016). Querylog-based assessment of retrievability bias in a large newspaper corpus. In J. S. Downie &#x0026; R. H. McDonald (Eds.), <italic>JCDL &#x2019;13, 13</italic><sup>th</sup><italic> ACM/IEEE-CS Joint Conference on Digital Libraries</italic> (pp. 7&#x2013;16). New York: ACM. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/2910896.2910907">https://doi.org/10.1145/2910896.2910907</ext-link>.</mixed-citation></ref>
<ref id="r28"><mixed-citation>Volk, M., Furrer, L., &#x0026; Sennrich, R. (2011). Strategies for reducing and correcting OCR error. In C. Sporleder, A. van den Bosch, &#x0026; K. Zervanou (Eds.), <italic>Language technology for cultural heritage</italic> &#x2013; <italic>Selected papers from the LaTeCH workshop series</italic> (pp. 3&#x2013;22). Berlin: Springer. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-642-20227-8_1">https://doi.org/10.1007/978-3-642-20227-8_1</ext-link>.</mixed-citation></ref>
</ref-list>
<fn-group>
<fn id="fn1"><p><ext-link ext-link-type="uri" xlink:href="https://www.newseye.eu/">https://www.newseye.eu/</ext-link>.</p></fn>
<fn id="fn2"><p>This estimation is based on the period 1771&#x2013;1910. Pletschacher, Clausner and Antonacopoulos (2015) report a word accuracy of 67.5&#x0025; for the Finnish part of the Europeana newspaper collection. This estimation is based on a selection of about 132 000 pages included in the Europeana data set.</p></fn>
<fn id="fn3"><p>About half of the collection is in Swedish, the second official language of Finland and up till about 1890 the main publication language of newspapers and journals. We have not estimated the quality of the Swedish data as thoroughly as quality of the Finnish data, but it seems that quality of the Swedish data is worse than quality of the Finnish data.</p></fn>
<fn id="fn4"><p><ext-link ext-link-type="uri" xlink:href="https://www.abbyy.com/en-eu/finereader/">https://www.abbyy.com/en-eu/finereader/</ext-link>.</p></fn>
<fn id="fn5"><p>&#x201C;In digital imaging and OCR, ground truth is the objective verification of the particular properties of a digital image, used to test the accuracy of automated image analysis processes. The ground truth of an image&#x2019;s text content, for instance, is the complete and accurate record of every character and word in the image. This can be compared to the output of an OCR engine and used to assess the engine&#x2019;s accuracy, and how important any deviation from ground truth is in that instance.&#x201D; <ext-link ext-link-type="uri" xlink:href="https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/">https://www.digitisation.eu/tools-resources/image-and-ground-truth-resources/</ext-link>. Cf. also M&#x00E4;rgner and El Abed (2014) and Carrasco, (2014).</p></fn>
<fn id="fn6"><p>This version was produced by the subcontractor when the GT data was formed.</p></fn>
<fn id="fn7"><p>The original data has 500 640 words. Parallelization of the different OCR versions of the data has proven hard, and we use the results of 471K of data that has content for every different OCR version.</p></fn>
<fn id="fn8"><p><ext-link ext-link-type="uri" xlink:href="https://sites.google.com/view/icdar2017-postcorrectionocr/dataset">https://sites.google.com/view/icdar2017-postcorrectionocr/dataset</ext-link>.</p></fn>
<fn id="fn9"><p><ext-link ext-link-type="uri" xlink:href="https://github.com/flammie/omorfi">https://github.com/flammie/omorfi</ext-link>.</p></fn>
<fn id="fn10"><p><ext-link ext-link-type="uri" xlink:href="https://github.com/jiemakel/omorfi">https://github.com/jiemakel/omorfi</ext-link>, M&#x00E4;kel&#x00E4; (2016).</p></fn>
<fn id="fn11"><p>Variation of <italic>w</italic> and <italic>v</italic> is one of the main differences between 19th century and modern Finnish spelling. <italic>W</italic> was used much more in 19th century, in modern Finnish it is used mostly in foreign names (e.g. <italic>Wagner</italic>).</p></fn>
<fn id="fn12"><p>Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965. See <ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/wiki/Levenshtein_distance">https://en.wikipedia.org/wiki/Levenshtein_distance</ext-link>.</p></fn>
<fn id="fn13"><p>&#x201C;It is impossible to correct very noisy texts, where the nature of the noise is random and words are distorted by a large edit distance (say 3 or more).&#x201D;</p></fn>
<fn id="fn14"><p><ext-link ext-link-type="uri" xlink:href="https://transkribus.eu/Transkribus/">https://transkribus.eu/Transkribus/</ext-link>.</p></fn>
</fn-group>
</back>
</article>


