Words’ Burstiness in Language Models
Authors | |
---|---|
Year of publication | 2011 |
Type | Article in Proceedings |
Conference | Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011 |
MU Faculty or unit | |
Citation | |
Web | https://nlp.fi.muni.cz/raslan/2011/paper17.pdf |
Field | Linguistics |
Keywords | Burstiness; Language models; Words' probability |
Description | Good estimation of the probability of a single word is a crucial part of language modelling. It is based on raw frequency of the word in a training corpus. Such computation is a good estimation for functional words and most very frequent words, but it is a poor estimation for most content words because of words' tendency to occur in clusters. This paper provides an analysis of words' burstiness and propose a new unigram language model which handles bursty words much better. The evaluation of the model on two data sets shows consistently lower perplexity and cross-entropy in the new model. |
Related projects: |