IEEHR 2017-The esposalles
OCR/Text Detection
License: CC BY-NC-ND 4.0


The extraction of relevant information from historical handwritten document collections is one of the key steps in order to make these manuscripts available for access and searches.
In this context, instead of a pure transcription, the objective is to move towards document understanding. Concretely,the aim is to detect the named entities and assign each of them a semantic category, such as family names, places, occupations, etc.
A typical application scenario of named entity recognition is demographic documents, since they contain people's names,birthplaces, occupations, etc. In this scenario, the extraction of the key contents and its storage in databases allows the access to their contents and envision innovative services based in genealogical, social or demographic searches.
Lately, the interest of the document image analysis community in document understanding, named entity recognition and semantic categorization is awaking, and some techniques based on HMMs, BLSTMs and CNNs have been proposed. With this competition, we aim to foster the research in this field an offer a benchmark for the research community.

Data Collection

This database consists of historical handwritten marriages records from the Archives of the Cathedral of Barcelona. The pages we used correspond to the volume 69, written in old Catalan by one single writer in the 17th century. Each marriage record contains information about the husbands occupation, place of origin, husbands and wifes former marital status, parents occupation, place of residence, geographical origin, etc.

Data Summary
Provided by
Computer Vision Center (CVC)
The CVC is a non-profit research center with an independent legal status, established in 1995 by the Generalitat de Catalunya and the Universitat Autònoma de Barcelona (UAB).
Start Building AI Now