The ground-truthed datasets of PDF tables
Text
OCR/Text Detection
|...
License: Unknown

Overview

Two ground-truthed datasets of natively-digital PDF documents containing tables.
On this page you will find two ground-truthed datasets of natively-digital PDF documents containing tables. These documents have been collected systematically from the European Union and US Government websites, and we therefore expect them to have public domain status. Each PDF document is accompanied by three XML (or CSV) file containing its ground truth in the following models:

  • table regions (for evaluating table location)
  • cell structures (for evaluating table structure recognition)
  • functional representation (for evaluating table interpretation)
Data Summary
Type
Text,
Amount
--
Size
--
Provided by
Dr. Tamir Hassan
A researcher, developer and consultant in the field of Document Engineering and have over 15 years of experience working with PDF and HTML(+CSS+JS) documents on topics including table recognition, automatic tagging, accessibility, layout optimization and conversion between the two formats.
Issue
Start Building AI Now