Intrinsic Methods for Comparison of Corpora
Autoři | |
---|---|
Rok publikování | 2013 |
Druh | Článek ve sborníku |
Konference | RASLAN 2013 Recent Advances in Slavonic Natural Language Processing |
Fakulta / Pracoviště MU | |
Citace | |
www | https://nlp.fi.muni.cz/raslan/2013/paper05.pdf |
Obor | Informatika |
Klíčová slova | text corpus; corpora comparison |
Popis | Since there are only very few techniques for quantitative and systematic comparison of text corpora we proposed and implemented several novel methods. The procedures were applied to comparing two very large web based Czech text corpora: czTenTen12 and Hector with more than 4.47 and 2.65 billion words, respectively. All methods are fully automatic and some of them are even language independent. We released some of them so they can be used instantly for comparison of other corpora. |
Související projekty: |