Utok: The Fast Rule-based Tokenizer
Authors | |
---|---|
Year of publication | 2022 |
Type | Article in Proceedings |
Conference | Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022 |
MU Faculty or unit | |
Citation | |
Web | |
Keywords | tokenizer; tokenization; text processing |
Description | Tokenization is one of the first processing steps in most natural language processing applications. The papper introduces a new tokenizer Utok which follows the Unitok tokenizer in the form of simplicity of configuration for different languages and is much faster in processing speed. |
Related projects: |