<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article article-type="research-article" xml:lang="EN" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">LIBER</journal-id>
<journal-title-group>
<journal-title>LIBER QUARTERLY</journal-title>
</journal-title-group>
<issn pub-type="epub">2213-056X</issn>
<publisher>
<publisher-name>openjournals.nl</publisher-name>
<publisher-loc>The Hague, The Netherlands</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">lq.13330</article-id>
<article-id pub-id-type="doi">10.53377/lq.13330</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Nautilus &#x2013; An End-To-End METS/ALTO OCR Enhancement Pipeline</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-9034-1551</contrib-id>
<name>
<surname>Schneider</surname>
<given-names>Pit</given-names>
</name>
<xref ref-type="aff" rid="aff1"/>
<email>pit.schneider@bnl.etat.lu</email>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-1862-1943</contrib-id>
<name>
<surname>Maurer</surname>
<given-names>Yves</given-names>
</name>
<xref ref-type="aff" rid="aff1"/>
<email>yves.maurer@bnl.etat.lu</email>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-8255-1786</contrib-id>
<name>
<surname>Marschall</surname>
<given-names>Ralph</given-names>
</name>
<xref ref-type="aff" rid="aff1"/>
<email>ralph.marschall@bnl.etat.lu</email>
</contrib>
<aff id="aff1">Department of IT and Digital Innovation, National Library of Luxembourg, Luxembourg</aff>
</contrib-group>
<pub-date pub-type="epub">
<month>03</month>
<year>2023</year>
</pub-date>
<volume>33</volume>
<fpage>1</fpage>
<lpage>19</lpage>
<permissions>
<copyright-statement>Copyright 2023, The copyright of this article remains with the author</copyright-statement>
<copyright-year>2023</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See <uri xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</uri>.</license-p>
</license>
</permissions>
<self-uri xlink:href="https://www.liberquarterly.eu/article/10.53377/lq.13330"/>
<abstract>
<p>When a digital collection has been processed by OCR, the usability expectations of patrons and researchers are high. While the former expect full text search to return all instances of terms in historical collections correctly, the latter are more familiar with the impacts of OCR errors but would still like to apply big data analysis or machine-learning methods. All of these use cases depend on high quality textual transcriptions of the scans. This is why the National Library of Luxembourg (BnL) has developed a pipeline to improve OCR for existing digitised documents. Enhancing OCR in a digital library not only demands improved machine learning models, but also requires a coherent reprocessing strategy in order to apply them efficiently in production systems. The newly developed software tool, Nautilus, fulfils these requirements using METS/ALTO as a pivot format. The BnL has open-sourced it so that other libraries can re-use it on their own collections. This paper covers the creation of the ground truth, the details of the reprocessing pipeline, its production use on the entirety of the BnL collection, along with the estimated results. Based on a quality prediction measure, developed during the project, approximately 28 million additional text lines now exceed the quality threshold.</p>
</abstract>
<kwd-group>
<kwd>OCR quality</kwd>
<kwd>OCR correction</kwd>
<kwd>METS/ALTO</kwd>
<kwd>Luxembourg historical newspapers</kwd>
<kwd>ground truth</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1. Introduction</title>
<p>The National Library of Luxembourg (BnL) has been digitising its newspaper heritage collections since 2005 in the METS/ALTO format with OCR processing and manual zoning of individual articles. The results are available on the <ext-link ext-link-type="uri" xlink:href="https://eluxemburgensia.lu/">eluxemburgensia</ext-link><sup><xref ref-type="fn" rid="fn1"><sup>1</sup></xref></sup> platform, which provides full-text search, displays the transcribed text alongside high-resolution images and can highlight search results on the image (<xref ref-type="fig" rid="fg001">Figure 1</xref>) using coordinates from the ALTO XML files.</p>
<fig id="fg001">
<label>Fig. 1:</label>
<caption><p>Example word-highlights, with one highlight spanning across two text lines.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig1.jpg"/></fig>
<p>The BnL has never had in-house scanning facilities so external suppliers were hired through public tenders to perform the scanning, computer-assisted zoning into articles, setting article types (e.g. ARTICLE, DEATH_NOTICE or ADVERTISEMENT) and producing METS/ALTO. While manual parts of the digitisation chain, like handling, scanning and zoning have been thoroughly checked by the quality assurance team from the BnL, this was not done for the raw OCR transcription of the articles. It follows that the quality of the textual representation varies hugely and depends on the different OCR engines used at the time. The idea was that OCR could be improved in an automated way while keeping the benefit of the manual work.</p>
<p>The newspaper corpus that has been digitised is multilingual (German, French, Luxembourgish and some other languages), uses a variety of typefaces (Antiqua, Fraktur), and mixes all of these elements even on the same page. As shown on <xref ref-type="fig" rid="fg002">Figure 2</xref>, it was challenging for the suppliers to always use the correct OCR engine for each block of text.</p>
<fig id="fg002">
<label>Fig. 2:</label>
<caption><p>Example block where OCR (output at left) was wrongly performed using ABBYY FineReader Engine 10 for French (Antiqua) (source image at right).</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig2.jpg"/></fig>
<p>The idea of improvement was validated in a pilot project using the Tesseract (<xref ref-type="bibr" rid="r15">Smith, 2007</xref>) software, during which a metric based on dictionaries was used to make sure that new OCR was not worse than what we got from the supplier. However, it was noted that &#x201C;<italic>Tesseract did not perform consistently better than the original OCR, as we had expected. In fact it performed slightly worse in an overall comparison</italic>&#x201D; (<xref ref-type="bibr" rid="r10">Maurer, 2017</xref>), for both Antiqua and Fraktur fonts, so a new effort was started in 2020 where all the individual components were tested again and improved as necessary. This led us to the development of <italic>Nautilus</italic>, an end-to-end METS/ALTO OCR enhancement pipeline applied and released as <ext-link ext-link-type="uri" xlink:href="https://github.com/natliblux/nautilusocr">open source</ext-link><sup><xref ref-type="fn" rid="fn2"><sup>2</sup></xref></sup> by the BnL.</p>
<p>The rest of this article is structured as follows: Section 2 presents the creation of a ground truth dataset. The main software pipeline is then laid out in Section 3, followed by a segment dedicated to the results related to its first application. Finally, Section 5 concludes the paper.</p>
</sec>
<sec id="s2">
<title>2. Ground Truth Generation</title>
<p>Nowadays machine learning models are at the core of OCR and their accuracy has a direct influence on the recognition quality for characters and words. As suggested by other, similar projects, like <xref ref-type="bibr" rid="r6">Kettunen et al. (2020)</xref>, training on BnL data would produce the best results with Nautilus, for any underlying OCR engine that we would end up choosing. For that reason, the first project step involved deriving <italic>ground truth</italic> from the BnL&#x2019;s digitised documents. To be able to generate such a dataset, we were faced with decisions regarding the data sampling strategy, transcription guidelines and quality assurance.</p>
<sec id="s2a">
<title>2.1. Data Sampling</title>
<p>During data sampling, we followed the objective of striking a balance between a diverse and a representative ground truth set. We tried to reflect the diversity of the corpus by balancing the following properties:</p>
<list list-type="bullet"><list-item><p>Language (German, French, Luxembourgish)</p></list-item>
<list-item><p>Newspaper title (52 distinct titles)</p></list-item>
<list-item><p>Publication date (1841&#x2013;1954)</p></list-item>
</list>
<p>The final selection includes 6723 text blocks, ranging from 1 to 1054 in number of transcribed words. The public domain part of the ground truth set has been published on the BnL&#x2019;s open data <ext-link ext-link-type="uri" xlink:href="https://data.bnl.lu/data/historical-newspapers/">platform</ext-link>.<sup><xref ref-type="fn" rid="fn3"><sup>3</sup></xref></sup></p>
</sec> 
<sec id="s2b">
<title>2.2. Transcription Guidelines</title>
<p>With financing made available by <italic>The AI4Gov initiative</italic> of the government of Luxembourg (<xref ref-type="bibr" rid="r18">The Luxembourg Government, n.d.</xref>), an external supplier was tasked with the transcription of these blocks. To guarantee its smooth execution, we needed a set of precise technical specifications. Those were fundamentally inspired by the OCR-D (<xref ref-type="bibr" rid="r11">Neudecker et al., 2019</xref>) project, but we ended up relaxing many requirements so that it was easier to work with but which also means that it is not 100% compatible with ground truth generated according to the real OCR-D guidelines.</p>
<p>During the execution of the project, we collaborated in a responsive manner with the supplier, to address unexpected cases and questions. To name a few examples, the transcription for upside down (<xref ref-type="fig" rid="fg003">Figure 3</xref>), horizontally reversed, invisible or artwork characters needed to be defined, once the first cases were discovered.</p>
<fig id="fg003">
<label>Fig. 3:</label>
<caption><p>Unexpected upside down &#x201C;r&#x201D; (2nd line, last character).</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig3.jpg"/></fig>
<p>In general, we defined the transcription process itself to follow three steps for every given block:</p>
<list list-type="order"><list-item><p>Identifying the text lines within the block.</p></list-item>
<list-item><p>Identifying the characters within each text line.</p></list-item>
<list-item><p>Take the existing ALTO file and update lines, words and coordinates (works for ALTO 1.0 and above).</p></list-item>
</list>
<p>Two decisions, tied to the first two steps, were taken to potentially allow the training of more robust character recognition models:</p>
<list list-type="bullet"><list-item><p>Ground truth text line bounding boxes were not allowed to be skewed (i.e. forced to be perfectly horizontal with respect to the block), entailing that the text line itself had the possibility to be slightly skewed.</p></list-item>
<list-item><p>Every single character transcription was assigned a confidence value, based on three discrete classes. This enabled including/excluding characters that were harder to recognise for the transcribers (e.g. because of smudged print) into the model training process.</p></list-item>
</list>
</sec>
<sec id="s2c">
<title>2.3. Quality Assurance</title>
<p>To validate the level of ground truth quality, the requirement for a global accuracy of 99.95% was set. To enforce this target, we checked a small subset of blocks using a combination of custom software and manual verification. The software was designed to perform as many automatic checks as possible, such as the filtering of non-whitelisted characters or the detection of overlapping line bounding boxes (potentially indicating an error).</p>
<p>Returned data batches generally featured a couple of smaller quality issues, which were promptly rectified with subsequent batch iterations. Among those problems, to name a few, were invalid ALTO files, wrong word bounding boxes or a wrong number of confidence values. In the end, after five months of preparing, specifying and transcribing the data, the ground truth project concluded and the desired accuracy target was met.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Nautilus</title>
<p>We next steer the focus to the Nautilus pipeline itself. The software encapsulates an end-to-end workflow with regard to the objective of enhancing OCR quality of existing METS/ALTO data.</p>
<sec id="s3a">
<title>3.1. Block Definition</title>
<p>To proceed with the specifics of the pipeline, we first require a more precise definition of a <italic>block</italic>. A block generally represents an image containing text of an individual paragraph or even a small article (ALTO <italic>TextBlock</italic> tag). In terms of layout, a block is always contained within a single text column. We refer to a block image (e.g. <xref ref-type="fig" rid="fg004">Figure 4</xref>), identified through index i, as B<sub>i</sub>. Similarly, a new Nautilus output (derived from B<sub>i</sub>) is denoted as B<sub>i</sub><sup>new</sup>, the <italic>original</italic> OCR text as B<sub>i</sub><sup>ori</sup> and the <italic>ground truth</italic> counterpart as B<sub>i</sub><sup>gt</sup>.</p>
<fig id="fg004">
<label>Fig. 4:</label>
<caption><p>Small sample block used to demonstrate the pipeline, in the following referred to as B<sub>s</sub>.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig4.jpg"/></fig>
<p>The targeting of existing blocks and not changing their coordinates in the output file means that the METS logical <italic>structMap</italic> does not need to be changed. This simplifies the integration of the resulting improvements into the METS/ALTO package.</p>
</sec>
<sec id="s3b">
<title>3.2. Mets/ALTO Pipeline</title>
<p>The six steps seen in <xref ref-type="fig" rid="fg005">Figure 5</xref> can be detailed as follows:</p>
<fig id="fg005">
<label>Fig. 5:</label>
<caption><p>METS/ALTO pipeline overview.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig5.jpg"/></fig>
<list list-type="order">
	<list-item><p><bold>Input</bold></p>
<p>A METS/ALTO package is expected as input. This means that, next to the METS file, every scanned page of the target document has to be represented as one single ALTO (original OCR) and one archive image file.</p></list-item>
<list-item><p><bold>Extraction</bold></p>
<p>Through parsing of the METS/ALTO files, every B<sub>i</sub><sup>ori</sup> is extracted and paired with the corresponding B<sub>i</sub>, which is obtained by cropping the respective page image.</p></list-item>
<list-item><p><bold>Targeting</bold></p>
<p>Not every pair represents an enhancement target, however. The pipeline allows control over the type of blocks that are subject to reprocessing (e.g. blocks of type <italic>DEATH_NOTICE</italic> and <italic>ADVERTISEMENT</italic> were discarded because their layout worked poorly with Nautilus).</p></list-item>
<list-item><p><bold>OCR</bold></p>
<p>The OCR step can be seen as a meta-pipeline, to which every target pair is fed. This is the basis for a potential conversion to B<sub>i</sub><sup>new</sup>. Alternatively, the unmodified B<sub>i</sub><sup>ori</sup> is retained. Further elaborations on the OCR pipeline will follow with the next subsections.</p></list-item>
<list-item><p><bold>Integration</bold></p>
<p>Next, every B<sub>i</sub><sup>new</sup> remains to be integrated into an updated METS/ALTO package. This concretely demands the update of every targeted <italic>TextBlock</italic> subtree within the respective ALTO file. Finally, the METS file is modified to incorporate the new ALTO checksums, file sizes and creation dates.</p></list-item>
<list-item><p><bold>Output</bold></p>
<p>To complete the circle, Nautilus outputs the input METS/ALTO package, but with hopefully improved OCR quality.</p></list-item>
</list>
</sec>
<sec id="s3c">
<title>3.3. OCR Pipeline</title>
<p>Diving in a bit deeper, located between the input block and the output text, are six distinct software components (<xref ref-type="fig" rid="fg006">Figure 6</xref>), which together form the OCR part of the pipeline.</p>
<fig id="fg006">
<label>Fig. 6:</label>
<caption><p>Sequentially structured OCR pipeline.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig6.jpg"/></fig>
<sec id="s3c1">
<title>3.3.1. Enhancement Prediction</title>
<p>Using a regression model trained on block features, denoted as <italic>Enhance</italic>, this component aims to predict the amount of improvement in OCR quality when retaining B<sub>i</sub><sup>new</sup> instead of B<sub>i</sub><sup>ori</sup>.</p>
<p>The regression model operates on a quality measure <italic>q</italic>, which is defined for B<sub>i</sub><sup>ocr</sup> as</p>
<p>&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;q(B<sub>i</sub><sup>ocr</sup>) = 1 - min(|B<sub>i</sub><sup>ocr</sup>|, edit(B<sub>i</sub><sup>ocr</sup>, B<sup>gt</sup>)) / |B<sub>i</sub><sup>ocr</sup>|</p>
<p>with</p>
<list list-type="bullet">
<list-item><p>the cardinality operator returning the amount of characters in the block (including whitespaces).</p></list-item>
<list-item><p><italic>edit</italic> being the popular <xref ref-type="bibr" rid="r8">Levenshtein (1965)</xref> distance.</p></list-item>
</list>
<p>It follows that the difference q(B<sub>i</sub><sup>new</sup>)-q(B<sub>i</sub><sup>ori</sup>) predicted by Enhance, ranges from &#x2013;1 to 1.</p>
<p>Any threshold value (denoted as <italic>&#x03B8;</italic>) can subsequently be applied to fork the pipeline and eventually terminate right away, should Enhance (B<sub>i</sub><sup>ori</sup>) be lower than desired. In that case, based on conducted experiments, the processing time of the pipeline generally stays below 5% of the time needed for a regular run (using all six components). This reduction in processing cost is joined by the two other motivations for enhancement prediction, namely:</p>
<list list-type="bullet"><list-item><p>Reduction of the risk to degrade the quality of some blocks.</p></list-item>
<list-item><p>Access to estimated improvement statistics.</p></list-item>
</list>
<p>For more detailed explanations, a dedicated article (<xref ref-type="bibr" rid="r14">Schneider &#x0026; Maurer, 2022</xref>) has been compiled, discussing all our enhancement prediction findings.</p>
<p>Leveraging B<sub>s</sub><sup>ori</sup> from Section 3.1, containing a couple of unrecognised characters (e.g. <italic>Grasen</italic> instead of <italic>Grafen</italic>), a considerable enhancement prediction comes with Enhance (B<sub>s</sub><sup>ori</sup>) = 0.067.</p>
</sec>
<sec id="s3c2">
<title>3.3.2. Binarisation</title>
<p>Next follows binarisation, having four major tasks, being:</p>
<list list-type="bullet"><list-item><p>Transforming B<sub>i</sub> into a binary image.</p></list-item>
<list-item><p>Dilating the image using a 2x2 pixels structuring element to potentially repair broken characters.</p></list-item>
<list-item><p>Padding the image so that the block has a leading and trailing white margin.</p></list-item>
<list-item><p>Possibly inverting an image containing light text on a dark background.</p></list-item>
</list>
<p>The final implementation of the binarisation component is largely based on <italic>OpenCV</italic> (<xref ref-type="bibr" rid="r1">Bradski, 2000</xref>) functions. The result of the application on B<sub>s</sub> can be seen with <xref ref-type="fig" rid="fg007">Figure 7</xref>.</p>
<fig id="fg007">
<label>Fig. 7:</label>
<caption><p>Binarised version of B<sub>s</sub>.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig7.jpg"/></fig>
</sec>
<sec id="s3c3">
<title>3.3.3. Text Line Segmentation</title>
<p>Segmenting the binary image into individual text lines is a required step for the subsequent font class and character recognition components. Our efforts regarding this step have been comprehensively documented by <xref ref-type="bibr" rid="r13">Schneider (2021)</xref>, describing the development of our own segmentation algorithm. Named <italic>CombiSeg</italic>, the method leverages a combination of morphological image operations and horizontal histogram projections to return a set of text line bounding boxes (<xref ref-type="fig" rid="fg008">Figure 8</xref>). OpenCV has once again been used to support this implementation.</p>
<fig id="fg008">
<label>Fig. 8:</label>
<caption><p>Visualisation of the text line bounding boxes for B<sub>s</sub>.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig8.jpg"/></fig>
<p>The decision to develop and adopt a new solution was driven by two factors:</p>
<list list-type="bullet"><list-item><p>Being a fast algorithm (<xref ref-type="bibr" rid="r13">Schneider, 2021</xref>), CombiSeg is a major contributor to the overall efficiency of Nautilus. Building a fast pipeline, that can be applied on a large volume of data in a relatively short time frame, was one of the main objectives.</p></list-item>
<list-item><p>CombiSeg utilises some parameters that we were able to tune based on our own data. This level of adaptation is usually not given with an out-of-the-box solution.</p></list-item>
</list></sec>
<sec id="s3c4">
<title>3.3.4. Font Class Recognition</title>
<p>Based on the diverse nature of our data and some conducted tests, targeting individual font classes seemed to be a key ingredient for improvements. That is why the pipeline forks on two font classes before arriving at the main character recognition component. The basis for this is a convolutional neural network classifying Antiqua and Fraktur fonts after having been trained on a set of individual character images for each class.</p>
<p>Depicted in <xref ref-type="fig" rid="fg009">Figure 9</xref>, a small preprocessing pipeline is responsible for the conversion of a segmented binary image into a set of individual characters (chars).</p>
<fig id="fg009">
<label>Fig. 9:</label>
<caption><p>Font class recognition preprocessing pipeline.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig9.jpg"/></fig>
<list list-type="order">
<list-item><p><bold>Isolate Characters</bold> Using the text line bounding box information obtained from the previous component, we isolate individual characters within every line. This is done by deriving a list of connected components from binary B<sub>i</sub>.</p></list-item>
<list-item><p><bold>Select Characters</bold> Selecting a subset of characters follows a strategy which prioritised first characters within words, having a higher chance of being capital letters (bigger visual difference among font classes). Second, first characters within the entire line are discarded (except for the first line) since they are more likely to represent digits (e.g. enumerations).</p></list-item>
<list-item><p><bold>Crop Characters</bold> Next, every character is cropped from binary B<sub>i</sub> and stored as an individual image.</p></list-item>
<list-item><p><bold>Clean Characters</bold> Possible adjacent characters, identified through the connected components map, are removed from every character image, in an attempt to strengthen the isolation process.</p></list-item>
<list-item><p><bold>Scale Characters</bold> Finally, scaling is applied so that every character image is of expected dimension 32 &#x00D7; 32 and can be processed by the neural network.</p></list-item>
</list>
<p>The convolutional neural network itself consists of two <italic>convolutional</italic> layers, each one followed by <italic>max pooling</italic>, joined by two <italic>fully connected</italic> layers, separated by a <italic>dropout</italic> layer in between.</p>
<p>Majority voting, through the run of a maximum of 15 preprocessed character images (<xref ref-type="fig" rid="fg010">Figure 10</xref>), ultimately decides on the font class.</p>
<fig id="fg010">
<label>Fig. 10:</label>
<caption><p>All 15 character images extracted from B<sub>s</sub>, with every single one being classified as Fraktur.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig10.jpg"/></fig>
</sec>
<sec id="s3c5">
<title>3.3.5. Character Recognition</title>
<p>The most crucial component, character recognition, is implemented using the K<italic>raken</italic> software (<xref ref-type="bibr" rid="r7">Kiessling, 2019</xref>), which has been forked from the <italic>OCRopus OCR System</italic> (<xref ref-type="bibr" rid="r2">Breuel, 2008</xref>). Next to being rather easy to use, we integrated the software library due to its ability to return geometric information about the location of characters within B<sub>i</sub>. This marks an obvious requirement for outputting the ALTO format.</p>
<p>Character recognition using Kraken is done by providing:</p>
<list list-type="order">
<list-item><p>Binary B<sub>i</sub>.</p></list-item>
<list-item><p>The text line segmentation information.</p></list-item>
<list-item><p>The font class (determining the Kraken model that is being applied).</p></list-item>
</list>
<p>The Kraken models are trained using the default network architecture and early stopping. <xref ref-type="fig" rid="fg011">Figure 11</xref> shows the recognition output for B<sub>s</sub>.</p>
<fig id="fg011">
<label>Fig. 11:</label>
<caption><p>B<sub>s</sub> and B<sub>i</sub><sup>new</sup> visualised side-by-side.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig11.jpg"/></fig>
</sec>
<sec id="s3c6">
<title>3.3.6. ALTO Generation</title>
<p>Finally, ALTO generation is implemented through the addition of some logic to improve the character positions returned by Kraken, in order to obtain a final set of word bounding boxes (<xref ref-type="fig" rid="fg012">Figure 12</xref>). Every character recognition confidence score <italic>conf</italic>, falling in the range of 0 to 1, is remapped to <italic>CONF</italic>, such that</p>
<fig id="fg012">
<label>Fig. 12:</label>
<caption><p>ALTO snippet of first TextLine of B<sub>s</sub><sup>new</sup>.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig12.jpg"/></fig>
<p>&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;&#x2003;CONF = round(9 - conf*9)</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4. Results</title>
<p>The necessary testing results, required to bring the newly developed enhancement tool to production, were attained in early 2021. At the time, the enhancement prediction component was still based on a binary method, classifying the quality of B<sub>i</sub><sup>ori</sup> (<italic>Quality</italic> in <xref ref-type="bibr" rid="r14">Schneider and Maurer (2022)</xref>).</p>
<sec id="s4a">
<title>4.1. Testing</title>
<p>Applied on a split-off ground truth test set, the following observations were made, using threshold &#x03B8; &#x003D; 0.95 (quality measure defined in chapter 3.3.1 for q(B<sub>i</sub><sup>ocr</sup>):</p>
<list list-type="bullet"><list-item><p>67% of the blocks were predicted to already exceed &#x03B8;.</p></list-item>
<list-item><p>Among the remaining 33% (reprocessing targets), 91% were improved in a way that a subsequent quality prediction on B<sub>i</sub><sup>new</sup> exceeded &#x03B8;. By leveraging  B<sub>i</sub><sup>gt</sup>,the average enhancement per block could be reviewed and was found to be 0.097 (in contrast to 0.04 for all blocks).</p></list-item>
</list>
<p>Another reassurance was the font classification accuracy of 99%, which proved to be valuable to especially enhance Fraktur based blocks.</p>
</sec>
<sec id="s4b">
<title>4.2. Batch Processing</title>
<p>We subsequently applied Nautilus on a little more than 100 thousand newspapers, containing just above 475 thousand pages (10 TB of TIF images), 8 million blocks and 175 million text lines. Processing time was reduced by splitting the corpus in 8 equally sized batches, which were fed to 8 instances of the software running in parallel. Based on the reprocessing rate of 33% while testing, we decided to retain the same &#x03B8; &#x003D; 0.95 threshold for this first application.</p>
<p>All the computations, which took nearly 15 days, were done on the same Linux-based virtual machine that was used for the training of the models. Enabled by the <xref ref-type="bibr" rid="r18"><italic>Luxembourg Government</italic> (n.d.)</xref>, we were able to use a machine with 8-CPU cores, a <italic>V100D-16C NVIDIA</italic> GPU and 48 GB of RAM. The entire dataset was stored and loaded from a Network File System.</p>
<p>Once the processing concluded, the data was re-ingested into the eluxemburgensia system, re-indexed into the <italic>Solr</italic> (<ext-link ext-link-type="uri" xlink:href="https://solr.apache.org/">https://solr.apache.org/</ext-link>) search engine and made available for our patrons.</p>
</sec>
<sec id="s4c">
<title>4.3. Evaluation</title>
<p>Although the absence of any ground truth counterparts rendered a precise assessment unfeasible this time, we were able to strongly tie the evaluation of the first Nautilus run to Quality, in addition to a vocabulary growth (<italic>VG</italic>) method proposed by <xref ref-type="bibr" rid="r10">Maurer (2017)</xref>.</p>
<p>To come up with a more representative picture, the final metrics were remodelled to reflect the number of text lines (blocks are varying heavily in size). Considering only the 23% of text lines (containing 85% Fraktur), that were ultimately reprocessed (due to prediction q(B<sub>i</sub><sup>ori</sup>) ≤ 0), Quality now assumes that:</p>
<list list-type="bullet">
<list-item><p>The fraction of lines that is exceeding the threshold equals 70% (Fraktur lines only: 75%).</p></list-item>
<list-item><p>An additional 15% of words can be found in a dictionary of the same language (Fraktur lines only: 18%).</p></list-item>
</list>
<p>Those numbers translate to an estimated additional 28 million lines (16% of the total 175 million) that are predicted to feature the sufficient quality standard.</p>
<p>The VG method proposed by <xref ref-type="bibr" rid="r19">Van de Camp (2008)</xref> and adapted by <xref ref-type="bibr" rid="r10">Maurer (2017)</xref> counts the number of unique words per million words. OCR errors introduce a lot of variations for a single correct word, so the VG measure is low when the OCR quality is high. That is why the vertical axis in <xref ref-type="fig" rid="fg013">Figure 13</xref> is upside down. As can be seen, the new OCR is consistently better or at least as good, according to this measure, but the improvements are uneven over the years and tend to be better for the older period. This can be explained by the fact that Nautilus carefully targets Fraktur and Antiqua in a way that the original supplier did not, and the period after the 2<sup>nd</sup> of March 1942 contains no more Fraktur, as explained in <xref ref-type="bibr" rid="r9">Luxemburger Wort (1942)</xref>.</p>
<fig id="fg013">
<label>Fig. 13:</label>
<caption><p>Comparison of the Vocabulary Growth measure for the newspaper title &#x201C;Luxemburger Wort&#x201D; (1849&#x2013;1950) before and after applying Nautilus.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="figures/LIBER_2023_33_Schneider_fig13.jpg"/></fig>
</sec>
<sec id="s4d">
<title>4.4. Comparison with Tesseract</title>
<p>Since our first tests were with Tesseract we also wanted to evaluate the performance of our pipeline against it, using version 5.0.0-alpha-815-g5761. For this comparison, we used a tool by <xref ref-type="bibr" rid="r3">Carrasco (2014)</xref> on the ground truth and Tesseract, respectively Nautilus, since it produces a measure that is also used in other papers, such as <xref ref-type="bibr" rid="r5">Hegghammer (2022)</xref>. We found:</p>
<list list-type="bullet">
<list-item><p>3.68 character error rate and 11.82 word error rate for Nautilus.</p></list-item>
<list-item><p>5.05 character error rate and 16.57 word error rate for Tesseract.</p></list-item>
</list>
<p>Tesseract was used with the <italic>Fraktur</italic>, <italic>deu</italic>, <italic>ltz</italic> and <italic>fra</italic> models as appropriate and the resulting text was post-processed to conform to the same guidelines as used in the ground truth (e.g. no long &#x201C;s&#x201D;, normalised dashes etc.)</p>
</sec>
</sec>
<sec id="s5">
<title>5. Conclusion</title>
<p>With the development of an open-source software pipeline and the generation of ground truth data, the objectives of the BnL to improve OCR have been met. In addition to promising results from the first run of Nautilus, this approach could be the basis for future iterations. Individual components of the pipeline will have to be improved, extended, or new components added.</p>
<p>Image processing is currently applied in a very limited manner in the binarisation component. However, dewarping, rotating and cleaning are interesting treatments that low quality source images could certainly benefit from. Moreover, it is imaginable to relax the current text block level requirement in the future by introducing layout analysis as a precursor to Nautilus.</p>
<p>In terms of component replacement, tests could be done on the generated ground truth with OCR engines other than Kraken, such as Tesseract used by <xref ref-type="bibr" rid="r6">Kettunen et al. (2020)</xref>. Both the text line segmentation algorithm and the enhancement prediction could also be further refined.</p>
<p>Another area of interest that could lead to improved accuracy is postprocessing. Here, promising approaches certainly come in the form of large language models and embeddings as in <xref ref-type="bibr" rid="r12">Nguyen et al. (2020)</xref> and in <xref ref-type="bibr" rid="r17">Soper et al. (2021)</xref>.</p>
<p>With the improvement of an estimated 28 million text lines, the project of correcting OCR of the BnL collections has been a success. The assumption that OCR can be corrected at scale through automated processes has been confirmed. It remains to be seen what conclusions the BnL can draw from this in terms of new digitisation projects. One option is to update the tender requirements and enforce a minimum OCR level, whose quality could be assured by running batches through the enhancement prediction (<xref ref-type="bibr" rid="r14">Schneider &#x0026; Maurer, 2022</xref>) to estimate whether Nautilus could still improve them. Alternatively, the specifications could remain unchanged since the BnL could run Nautilus automatically on newly digitised documents. In both cases, the BnL will be able to provide better services to the patrons and researchers.</p>
</sec>
</body>
<back>
<fn-group>
<title>Notes</title>
<fn id="fn1"><p>URL: <ext-link ext-link-type="uri" xlink:href="https://eluxemburgensia.lu">https://eluxemburgensia.lu</ext-link>.</p></fn>
<fn id="fn2"><p>URL: <ext-link ext-link-type="uri" xlink:href="https://github.com/natliblux/nautilusocr">https://github.com/natliblux/nautilusocr</ext-link>.</p></fn>
<fn id="fn3"><p>URL: <ext-link ext-link-type="uri" xlink:href="https://data.bnl.lu/data/historical-newspapers">https://data.bnl.lu/data/historical-newspapers</ext-link>.</p></fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="r1"><mixed-citation>Bradski, G. (2000). The OpenCV Library. <italic>Dr. Dobb&#x2019;s Journal of Software Tools, 25</italic>(11), 120&#x2013;125.</mixed-citation></ref>
<ref id="r2"><mixed-citation>Breuel, T. M. (2008). The OCRopus open source OCR system. <italic>Proc. SPIE 6815, Document Recognition and Retrieval XV, 68150F</italic>. Electronic Imaging 2005, San Jose, California, USA. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1117/12.783598">https://doi.org/10.1117/12.783598</ext-link></mixed-citation></ref>
<ref id="r3"><mixed-citation>Carrasco, R. C. (2014). An open-source OCR evaluation tool. <italic>Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage &#x2013; DATeCH&#x2019; 14</italic> (pp. 179&#x2013;184). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/2595188.2595221">https://doi.org/10.1145/2595188.2595221</ext-link></mixed-citation></ref>
<ref id="r5"><mixed-citation>Hegghammer, T. (2022). OCR with Tesseract, Amazon Textract, and Google Document AI: A benchmarking experiment. <italic>Journal of Computational Social Science</italic>, <italic>5</italic>(1), 861&#x2013;882. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s42001-021-00149-1">https://doi.org/10.1007/s42001-021-00149-1</ext-link></mixed-citation></ref>
<ref id="r6"><mixed-citation>Kettunen, K., Koistinen, M., &#x0026; Kervinen, J. (2020). Ground truth OCR sample data of Finnish historical newspapers and journals in data improvement validation of a re-OCRing process. <italic>LIBER Quarterly, 30</italic>(1). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.18352/lq.10322">https://doi.org/10.18352/lq.10322</ext-link></mixed-citation></ref>
<ref id="r7"><mixed-citation>Kiessling, B. (2019). Kraken - a universal text recognizer for the humanities. <italic>Digital Humanities Conference 2019 (DH2019)</italic>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/https:/doi.org/10.34894/Z9G2EX">https://doi.org/10.34894/Z9G2EX</ext-link></mixed-citation></ref>
<ref id="r8"><mixed-citation>Levenshtein, V. (1965). Binary codes capable of correcting spurious insertions and deletions of ones. <italic>Problems of Information Transmission</italic>, <italic>1</italic>, 8&#x2013;17.</mixed-citation></ref>
<ref id="r9"><mixed-citation>Luxemburger Wort, (1942). Neues Kleid. <italic>Luxemburger Wort, 2.3.1942</italic>(61), 1. <ext-link ext-link-type="uri" xlink:href="https://persist.lu/ark:70795/g3vmw4/pages/1/articles/DTL47">https://persist.lu/ark:70795/g3vmw4/pages/1/articles/DTL47</ext-link></mixed-citation></ref>
<ref id="r10"><mixed-citation>Maurer, Y. (2017). Improving the quality of the text, a pilot project to assess and correct the OCR in a multilingual environment. <italic>Relying on News Media. Long Term Preservation and Perspectives for Our Collective Memory</italic>. <ext-link ext-link-type="uri" xlink:href="https://nbn-resolving.org/urn:nbn:de:bsz:14-qucosa2-164455">https://nbn-resolving.org/urn:nbn:de:bsz:14-qucosa2-164455</ext-link></mixed-citation></ref>
<ref id="r11"><mixed-citation>Neudecker, C., Baierer, K., Federbusch, M., Boenig, M., W&#x00FC;rzner, K.-M., Hartmann, V., &#x0026; Herrmann, E. (2019). OCR-D: An end-to-end open source OCR framework for historical printed documents. <italic>Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage</italic> (pp. 53&#x2013;58). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3322905.3322917">https://doi.org/10.1145/3322905.3322917</ext-link></mixed-citation></ref>
<ref id="r12"><mixed-citation>Nguyen, T. T. H., Jatowt, A., Nguyen, N.-V., Coustaty, M., &#x0026; Doucet, A. (2020). Neural machine translation with BERT for post-OCR error detection and correction. <italic>Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020</italic> (pp. 333&#x2013;336). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3383583.3398605">https://doi.org/10.1145/3383583.3398605</ext-link></mixed-citation></ref>
<ref id="r13"><mixed-citation>Schneider, P. (2021). Combining morphological and histogram based text line segmentation in the OCR Context. <italic>Journal of Data Mining &#x0026; Digital Humanities</italic>, 2021 (HistoInformatics). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.46298/jdmdh.7277">https://doi.org/10.46298/jdmdh.7277</ext-link></mixed-citation></ref>
<ref id="r14"><mixed-citation>Schneider, P., &#x0026; Maurer Y. (2022). <italic>Rerunning OCR - A machine learning approach to quality assessment and enhancement prediction</italic>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2110.01661">https://arxiv.org/abs/2110.01661</ext-link></mixed-citation></ref>
<ref id="r15"><mixed-citation>Smith, R. (2007). An overview of the Tesseract OCR engine. <italic>Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba Brazil</italic> (pp. 629&#x2013;633). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/icdar.2007.4376991">https://doi.org/10.1109/icdar.2007.4376991</ext-link></mixed-citation></ref>
<ref id="r17"><mixed-citation>Soper, E., Fujimoto, S., &#x0026; Yu, Y.-Y. (2021). BART for post-correction of OCR newspaper text. <italic>Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT 2021)</italic> (pp. 284&#x2013;290). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.18653/v1/2021.wnut-1.31">https://doi.org/10.18653/v1/2021.wnut-1.31</ext-link></mixed-citation></ref>
<ref id="r18"><mixed-citation>The Luxembourg Government. (n.d). <italic>The AI4gov initiative</italic>. Retrieved November 2, 2022, from <ext-link ext-link-type="uri" xlink:href="https://gouvernement.lu/en/dossiers.gouv_digitalisation%2Ben%2Bdossiers%2B2021%2BAI4Gov.html">https://gouvernement.lu/en/dossiers.gouv_digitalisation%2Ben%2Bdossiers%2B2021%2BAI4Gov.html</ext-link></mixed-citation></ref>
<ref id="r19"><mixed-citation>Van de Camp, M. (2008). <italic>Explorations into unsupervised corpus quality assessment</italic> (Doctoral dissertation. Tilburg Univiersity, The Netherlands). Retrieved November 9, 2022, from <ext-link ext-link-type="uri" xlink:href="http://ilk.uvt.nl/downloads/pub/papers/hait/camp2008.pdf">http://ilk.uvt.nl/downloads/pub/papers/hait/camp2008.pdf</ext-link></mixed-citation></ref>
</ref-list>
</back>
</article>