Who is Selling to Whom – Feature Evaluation for Multi-block Classification in Invoice Information Extraction
Autoři | |
---|---|
Rok publikování | 2021 |
Druh | Článek ve sborníku |
Konference | SPECOM 2021: 23rd International Conference on Speech and Computer |
Fakulta / Pracoviště MU | |
Citace | |
www | https://link.springer.com/chapter/10.1007/978-3-030-87802-3_23 |
Doi | http://dx.doi.org/10.1007/978-3-030-87802-3_23 |
Klíčová slova | OCR; Invoice; Block type classification; Seller; Buyer; Delivery address |
Popis | The invoice information extraction task aims at unifying the automatized processing of invoices in structured forms and in the form of a scanned image. Recognizing the pieces of information where a specific value is identified with a keyword (such as the invoice date) is a relatively well-managed task. On the other hand, identification of multi-block information on the invoice, such as distinguishing the seller, buyer, and the delivery address, is much more challenging due to versatile invoice layouts. In this work, we present a new technique of feature extraction and classification to recognize the seller, buyer, and delivery address text blocks in scanned invoices based on a combination of complex layout and annotated text features. The method does not only consider the block positional features but also the relation between blocks and block contents at a higher level. The technique is implemented as a module of the OCRMiner system. We offer its detailed evaluation and error analysis with a dataset of more than five hundred Czech invoices reaching the overall macro average F1-score of 94%. |
Související projekty: |