ICDAR2019 Post-OCR Text Correction
Text
OCR/Text Detection
|NLP
|...
License: Custom

Overview

This original corpus consist in OCRed documents from 10 European languages with about 20M characters (3.5M tokens) aligned with their corresponding Gold Standard (Ground-Truth). Each language contain one or several sub-folders (unbalanced) according to collected dataset sources as follows:
Dataset details : partitioning md-1

The original excel form click here. Each training file contain three blocs according to the following structure. Note that only the first block [OCR_output] will be included in the test set.
md-2

Citation

@inproceedings{rigaud2019pocr,
 title="ICDAR 2019 Competition on Post-OCR Text Correction",
 author={Rigaud, Christophe and Doucet, Antoine and Coustaty, Mickael and Moreux, Jean-Philippe},
 year={2019},
 booktitle={Proceedings of the 15th International Conference on Document Analysis and Recognition (2019)}
 }

License

Custom

Data Summary
Type
Text,
Amount
--
Size
--
Provided by
ICDAR 2019
ICDAR is a very successful and flagship conference series, which is the biggest and premier international gathering for researchers, scientist and practitioners in the document analysis community.
Issue
Start Building AI Now