Contract Metadata Identification in Czech Scanned Documents

Varování

Publikace nespadá pod Fakultu sportovních studií, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	HA Hien Thi HORÁK Aleš MINH TUAN BUi
Rok publikování	2021
Druh	Článek ve sborníku
Konference	Proceedings of the 13th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	https://www.scitepress.org/PublicationsDetail.aspx?ID=Alw3hhTLE1M=&t=1
Doi	http://dx.doi.org/10.5220/0010243807950802
Klíčová slova	Information Extraction; Scanned Documents; Document Metadata; Contract Metadata Extraction; Czech
Popis	Although nowadays digital-born documents are generally prevalent, exchange of business documents often consists in processing their scanned image form as a general human-readable format with one-to-one correspondence to paper documents. Bulk processing of such scanned documents then requires human intervention to extract and enter the main document metadata. In this paper, we present the design and evaluation of a contract processing module in the OCRMiner system. The information extraction process allows to combine layout properties with text analysis as input to a rule-based extraction with confidence score propagation. The first results are evaluated with public Czech contract documents reaching the item extraction accuracy of almost 88%.
Související projekty:	LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy