Utok: The Fast Rule-based Tokenizer

Warning

This publication doesn't include Faculty of Sports Studies. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	RYCHLÝ Pavel ŠPALEK Samuel
Year of publication	2022
Type	Article in Proceedings
Conference	Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022
MU Faculty or unit	Faculty of Informatics
Citation
Web	Plný text Domovská stránka workshopu
Keywords	tokenizer; tokenization; text processing
Description	Tokenization is one of the first processing steps in most natural language processing applications. The papper introduces a new tokenizer Utok which follows the Unitok tokenizer in the form of simplicity of configuration for different languages and is much faster in processing speed.
Related projects:	LINDAT/CLARIAH-CZ - Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy