Character-based Language Model

Warning

This publication doesn't include Faculty of Sports Studies. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	BAISA Vít
Year of publication	2014
Type	Article in Proceedings
Conference	Eighth Workshop on Recent Advances in Slavonic Natural Language Processing
MU Faculty or unit	Faculty of Informatics
Citation
web	https://nlp.fi.muni.cz/raslan/2014/6.pdf
Field	Linguistics
Keywords	language model; suffix array; LCP; trie; character-based; random text generator; corpus
Description	Language modelling and also other natural language processing tasks are usually based on words. I present here a more general yet simpler approach to language modelling using much smaller units of text data: character-based language model (CBLM). In this paper I describe the underlying data structure of the model, evaluate the model using standard measures (entropy, perplexity). As a proof-of-concept and an extrinsic evaluation I present also a random sentence generator based on this model.
Related projects:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum