Intrinsic Methods for Comparison of Corpora
Authors | |
---|---|
Year of publication | 2013 |
Type | Article in Proceedings |
Conference | RASLAN 2013 Recent Advances in Slavonic Natural Language Processing |
MU Faculty or unit | |
Citation | |
Web | https://nlp.fi.muni.cz/raslan/2013/paper05.pdf |
Field | Informatics |
Keywords | text corpus; corpora comparison |
Description | Since there are only very few techniques for quantitative and systematic comparison of text corpora we proposed and implemented several novel methods. The procedures were applied to comparing two very large web based Czech text corpora: czTenTen12 and Hector with more than 4.47 and 2.65 billion words, respectively. All methods are fully automatic and some of them are even language independent. We released some of them so they can be used instantly for comparison of other corpora. |
Related projects: |