Brown Bag Lunch: @SUB

, 12:30 pm to

Vortragsraum, SUB/Papendiek 14. Talks: - Eliese-Sophia Lincke, research fellow at the Campus Lab on "Digitizing Coptic Texts" & Kirill Bulert, research assistant at the eTrap group on "Text reuse detection with TRACER"

Eliese-Sophia Lincke: 
Digitizing Coptic Texts

Digital Coptology is booming and Natural Language Processing is a vital part of it – be it on the level of individual research or on the level of larger projects’ database and corpus building. The first step in this process is text digitization. Until now, Coptic texts had to be manually re-typed. Optical Character Recognition (OCR) is a promising method that will facilitate and accelerate the digitization process of Coptic texts. But “turn key” OCR programs like ABBYY FineReader or OmniPage are not adjusted to Coptic, i.e. cannot recognize the Coptic Unicode characters. Trainable OCR programs like Ocropus (also called Ocropy) or Tesseract, however, can be trained to do so in order to extract Coptic text from image files (or pdf). Tests with these programs have been run by myself as well as by others with promising results (Kirill Bulert, So Miyagawa and Marco Büchler 2017 for Ocropus; Moheb S. Mekhaiel for Tesseract). My own preliminary tests have shown that Ocropus reaches a satisfying accuracy rate (>98 up to >99%).
During my fellowship, I will continue the work on Coptic character recognition and text extraction, mainly with Ocropus. The aim is to produce OCR models and to make them available online (as well as, if possible under the copyright law, text files of the texts that have been OCR’ed). This will enable Coptic scholars to extract and use text files for their research without having to repeat the whole training process themselves.
My talk at the Brown Bag Lunch will show how training Ocropus and extracting text with this tool works. (Previous knowledge of Coptic is not required, of course ;-) )
 



Kirill Bulert: 
Text reuse detection with TRACER — Scientific findings and current development report

Abstract: The search for text reuse is not only about pure plagiarism,
but also about being able to detect paraphrases and semantic reuse. And
although scholars are trying to find reuse in books for hundreds of
years, the task is far from being simple, especially if the texts span
over hundreds of years or over cultural borders. TRACER is a software
that was developed to detect text reuse inside   text reuse also
includes  The search for text reuse inside a corpus is not an easy task.