Words’ Burstiness in Language Models

Warning

This publication doesn't include Faculty of Sports Studies. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	RYCHLÝ Pavel
Year of publication	2011
Type	Article in Proceedings
Conference	Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011
MU Faculty or unit	Faculty of Informatics
Citation
web	https://nlp.fi.muni.cz/raslan/2011/paper17.pdf
Field	Linguistics
Keywords	Burstiness; Language models; Words' probability
Description	Good estimation of the probability of a single word is a crucial part of language modelling. It is based on raw frequency of the word in a training corpus. Such computation is a good estimation for functional words and most very frequent words, but it is a poor estimation for most content words because of words' tendency to occur in clusters. This paper provides an analysis of words' burstiness and propose a new unigram language model which handles bursty words much better. The evaluation of the model on two data sets shows consistently lower perplexity and cross-entropy in the new model.
Related projects:	Legal e-dictionary - PES Temporální aspekty znalostí a informací