References

Berg-Kirkpatrick, T., & Klein, D. (2014). Improved typesetting models for historical OCR. In K. Toutanova & H. Wu (Eds.), Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (pp. 118–123). https://doi.org/10.3115/v1/ P14-2020 .

Carrasco, R. C. (2014). An open-source OCR evaluation tool. In A. Antonacopoulos & K. U. Schulz (Eds.), Proceedings of the first international conference on digital access to textual cultural heritage (DATeCH ‘14 ) (pp. 179–184). New York: ACM. https://doi. org/10.1145/2595188.2595221 .

Choudhury, M., Thomas, M., Mukherjee, A., Basu, A., & Ganguly, N. (2007). How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach. In C. Biemann, I. Matveeva, R. Mihalcea, & D. Radev (Eds.), TextGraphs-2: Graph-based algorithms for natural language processing – Proceedings of the workshop (HTL-NAACL 2007) (pp. 81–88). New Brunswick, NL: Association for Computational Linguistics. https://arxiv.org/pdf/physics/0703198.pdf .

Dashti, S. M. (2018). Real-word error correction with trigrams: Correcting multiple errors in a sentence. Language Resources and Evaluation, 52, 485–502. https://doi. org/10.1007/s10579-017-9397-4 .

Drobac, S., Kauppinen, P., & Lindén, K. (2017). OCR and post-correction of historical Finnish texts. In J. Tiedermann (Ed.), NoDaLiDa, Proceedings of the 21th Nordic conference on computational linguistics (pp. 70–76). Linköping: Linköping University Electronic Press. Retrieved January 23, 2020, from https://www.aclweb.org/ anthology/W17-0209.pdf .

Dunning, A. (2012). European newspaper survey report. Retrieved January 23, 2020, from http://www.europeana-newspapers.eu/wp-content/uploads/2012/04/D4.1 Europeana-newspapers-survey-report.pdf .

Ghosh, K., Chakrabortya, A., Parui, S. K., & Majumder, P. (2016). Improving information retrieval performance on OCRed text in the absence of clean text ground truth. Information Processing and Management, 52(5), 873–884. https://doi. org/10.1016/j.ipm.2016.03.006 .

Järvelin, A., Keskustalo, H., Sormunen, E., Saastamoinen, M., & Kettunen, K. (2016). Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. Journal of the Association for Information Science and Technology, 67(12), 2928–2946. https://doi.org/10.1002/asi.23379 .

Kettunen K. (2016). Keep, change or delete? Setting up a low resource OCR postcorrection framework for a digitized old Finnish newspaper collection. In D. Calvanese, D. De Nart, & C. nTasso (Eds.), Digital libraries on the move (IRCDL 2015) (Vol. 612, pp. 95–103). Communications in Computer and Information Science. Cham, CH: Springer. https://doi.org/10.1007/978-3-319-41938-1_11 .

Kettunen, K., & Pääkkönen, T. (2016). Measuring lexical quality of a historical Finnish newspaper collection – Analysis of garbled OCR data with basic language technology tools and means. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, …, S. Piperidis (Eds.), Proceedings of the tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 956–961). Retrieved January 23, 2020, from http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdf .

Kettunen, K., Pääkkönen, T., & Koistinen, M. (2016). Between diachrony and synchrony: Evaluation of lexical quality of a digitized historical Finnish newspaper and journal collection with morphological analyzers. In I. Skadiņa & R. Rozis (Eds.), Human language technologies – The Baltic perspective. Proceedings of the seventh international conference (Baltic HLT 2016) (pp. 122–129). Amsterdam: IOS Press. Retrieved January 23, 2020, from http://ebooks.iospress.nl/volume/ human-language-technologies-the-baltic-perspective-proceedings-of-the-seventh international-conference-baltic-hlt-2016 .

Kettunen, K., & Koistinen, M. (2019). Open Source Tesseract in Re-OCR of Finnish Fraktur from 19th and early 20th century newspapers and journals – Collected notes on quality improvement. In C. Navarretta, M. Agirrezabal, & B. Maegaard (Eds.), Proceedings of the Digital Humanities in the Nordic countries 4th conference (DHN2019) (pp. 270–282). Retrieved January 23, 2020, from http://ceur-ws.org/Vol-2364/25_ paper.pdf .

Klein, S. T., & Kopel, M. (2002). A voting system for automatic OCR correction. In J. Callan, P. Kantor, & D. Grossmann (Eds.), Proceedings of the SIGIR 2002 Workshop on information retrieval and OCR: From converting content to grasping meaning (n.p.). Retrieved January 23, 2020, from http://boston.lti.cs.cmu.edu/callan/Workshops/ IR-OCR-02/tklein.pdf .

Koistinen, M., Kettunen, K., & Pääkkönen, T. (2017). Improving optical character recognition of finnish historical newspapers with a combination of Fraktur & Antiqua models and image preprocessing. In J. Tiedermann (Ed.), NoDaLiDa, Proceedings of the 21th Nordic conference on computational linguistics (pp. 277–283). Linköping: Linköping University Electronic Press. Retrieved January 23, 2020, from http://www.ep.liu.se/ ecp/131/038/ecp17131038.pdf .

Koistinen, M., Kettunen, K., & Kervinen, J. (2018). Bad OCR has a nasty character – re-OCRing historical Finnish newspaper material 1771–1910. Submitted to International Journal of Document Recognition and Analysis.

Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.

Lopresti, D. (2009). Optical character recognition errors and their effects on natural language processing. International Journal on Document Analysis and Recognition, 12, 141–151. https://doi.org/10.1007/s10032-009-0094-8 .

Mäkelä, Eetu. (2016). LAS: An integrated language analysis tool for multiple languages. The Journal of Open Source Software, 1(6):35, 1–2. https://doi.org/10.21105/ joss.00035 .

Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: The MIT Press.

Märgner, V., & El Abed, H. (2014). Tools and metrics for document analysis system evaluation. In D. Doermann & K. Tombre (Eds.), Handbook of document image processing and recognition (pp. 1011–1036). London: Springer Verlag.

Pääkkönen, T., Kervinen, J., Nivala, A., Kettunen, K., & Mäkelä, E. (2016). Exporting Finnish digitized historical newspaper contents for offline use. D-Lib Magazine, 22(7/8), n.p. https://doi.org/10.1045/july2016-paakkonen .

Pletschacher, S., Clausner, C., & Antonacopoulos, A. (2015). Europeana newspapers OCR workflow evaluation. In B. Coüasnon, V. Märgner, V. Frinken, & B. Barrett (Eds.), HIP ‘15, Proceedings of the 3rd International workshop on historical document imaging and processing (pp. 39–46). New York: ACM Digital Library. https://doi. org/10.1145/2809544.2809554 .

Reynaert, M. (2016). OCR Post-correction evaluation of early Dutch books online – Revisited. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, …, S. Piperidis (Eds.), Proceedings of the tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 967–974). Retrieved January 23, 2020, from https://pure.uvt.nl/ws/portalfiles/portal/14518959/LREC2016.EDBOeval. FinalSubmittedVersion.redownloaded20160318.pdf .

Silfverberg, M., Kauppinen, P., & Lindén, K. (2016). Data-driven spelling correction using weighted finite-state method. In B. Jurish, A. Maletti, U. Springmann, & K.-M. Würzner, Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (pp. 51–59). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved January 23, 2020, from https://aclweb.org/anthology/W/W16/W16-2406. pdf .

The “State of the Art”: A Comparative analysis of newspaper digitization to date (2015). Retrieved January 23, 2020, from https://www.crl.edu/sites/default/files/d6/ attachments/events/ICON_Report-State_of_Digitization_final.pdf .

Tanner, S., Muñoz, T., & Ros, P. H. (2009). Measuring mass text digitization quality and usefulness. Lessons learned from assessing the OCR accuracy of the British Library’s 19th Century Online Newspaper Archive. D-Lib Magazine, 15(8), n.p. https://doi.org/10.1045/july2009-munoz .

Traub, M. C., Samar, T., Ossenbruggen, J. van, He, J., Vries, A. de, & Hardman, L. (2016). Querylog-based assessment of retrievability bias in a large newspaper corpus. In J. S. Downie & R. H. McDonald (Eds.), JCDL ’13, 13th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 7–16). New York: ACM. https://doi. org/10.1145/2910896.2910907 .

Volk, M., Furrer, L., & Sennrich, R. (2011). Strategies for reducing and correcting OCR error. In C. Sporleder, A. van den Bosch, & K. Zervanou (Eds.), Language technology for cultural heritageSelected papers from the LaTeCH workshop series (pp. 3–22). Berlin: Springer. https://doi.org/10.1007/978-3-642-20227-8_1 .